数据出入
In Elasticsearch, all data in every field is indexed by default. That is, every field has a dedicated inverted index for fast retrieval. And, unlike most other databases, it can use all of those inverted indices in the same query, to return results at breathtaking speed.
文档格式
{
"name": "John Smith",
"age": 42,
"confirmed": true,
"join_date": "2014-06-01",
"home": {
"lat": 51.5,
"lon": 0.1
},
"accounts": [
{
"type": "facebook",
"id": "johnsmith"
},
{
"type": "twitter",
"id": "johnsmith"
}
]
}
Warning: Field names can be any valid string, but may not include periods.
每个文档包含了三个元数据:
_index
: Where the document lives_type
: The class of object that the document represents_id
: The unique identifier for the document
存储文档
PUT /{index}/{type}/{id}
{
"field": "value",
...
}
举例:
PUT /website/blog/123
{
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}
{
"_index": "website",
"_type": "blog",
"_id": "123",
"_version": 1,
"created": true
}
每个文档都有一个版本号,每次修改或删除文档时,_version
就会自增。
如果不提供ID,就会随机产生一个ID。但是注意,必须用POST
(“store this document under
this URL”)方法代替PUT
(“store this document at
this URL”)方法。
POST /website/blog/
{
"title": "My second blog entry",
"text": "Still trying this out...",
"date": "2014/01/01"
}
{
"_index": "website",
"_type": "blog",
"_id": "AVFgSgVHUP18jI2wRx0w",
"_version": 1,
"created": true
}
Autogenerated IDs are 20 character long, URL-safe, Base64-encoded GUID strings. These GUIDs are generated from a modified FlakeID scheme which allows multiple nodes to be generating unique IDs in parallel with essentially zero chance of collision.
读取文档
使用GET
方法即可读取一个指定ID的文档,追加?pretty
参数可以美化json输出。如果没有找到文档,将会返回404
。可以通过curl -i
参数打印出服务器的响应头:
curl -i -XGET http://localhost:9200/website/blog/124?pretty
HTTP/1.1 404 Not Found
Content-Type: application/json; charset=UTF-8
Content-Length: 83
{
"_index" : "website",
"_type" : "blog",
"_id" : "124",
"found" : false
}
默认情况下GET
会返回整个文档,存储在_source
字段下。如果只想返回指定的部分字段,可以像这样:
GET /website/blog/123?_source=title,text
{
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 1,
"found" : true,
"_source" : {
"title": "My first blog entry" ,
"text": "Just trying this out..."
}
}
如果只想显示_source
,忽略元数据字段,可以像这样:
GET /website/blog/123/_source
{
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}
检测文档是否存在
使用HEAD
方法即可:
curl -i -XHEAD http://localhost:9200/website/blog/123
HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 0
curl -i -XHEAD http://localhost:9200/website/blog/124
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=UTF-8
Content-Length: 0
更新整个文档
ES中存储的文档是不可变的,无法只更新文档的部分字段,必须reindex或replace。
使用index
API即可完成replace:
PUT /website/blog/123
{
"title": "My first blog entry",
"text": "I am starting to get the hang of this...",
"date": "2014/01/02"
}
{
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 2,
"created": false
}
_version
字段自增,created
为false
,因为之前这个ID已经存在了。
在内部,ES会对旧文档标记为删除,并添加了整个新文档。但是旧文档不会立刻被清理掉,你也无法访问到。ES会在后台清理掉文档。
后面会提到一个update
API,看起来是部分替换了文档。但实际上还是遵从前面提到的原则:
- Retrieve the JSON from the old document
- Change it
- Delete the old document
- Index a new document
不同的是update
通过客户端一次请求实现,而不是分开GET
和index
请求。
创建新文档
想要确保创建一个新文档而不是覆盖已有的文档,最简单的方法是使用POST
方法不指定ID:
POST /website/blog/
{ ... }
如果必须要指定_id
的话,使用op_type
query string或/_create
endpoint:
PUT /website/blog/123?op_type=create
{ ... }
PUT /website/blog/123/_create
{ ... }
如果成功创建,返回201 created
,否则返回409 Conflict
。
删除文档
DELETE /website/blog/123
如果找到文档,返回200 OK
,响应body中_version
增加1:
{
"found" : true,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 3
}
找不到就返回404 Not Found
,_version
也不会增加:
{
"found" : false,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 4
}
处理冲突
这种情况在关系型数据库中称为不可重复读。在大多数ES场景中无需关心——因为用作关系型数据库的缓存,基本上只会插入数据,几乎不会修改数据。
但是如果需要处理这种冲突的场景时,可以按照以下方案解决:
- 悲观并发控制(Pessimistic concurrency control): 关系型数据库常用。假定冲突修改随时可能发生,因此每次修改需要锁定资源.典型的例子就是读取前锁定一行数据,确保只有一个线程能修改这一行的数据
- 乐观并发控制(Optimistic concurrency control): ES使用这种方法。假定冲突不太可能发生,因此并不阻止更新操作。如果在读写间数据发生了修改,那么更新就会失败,交由应用程序自己处理冲突。如刷新数据后重试,或者给用户反馈情况。
乐观并发控制
每个文档都有一个_version
元数据,存储着文档的修改次数。可以利用这个属性确保修改是由应用程序本身修改的。
PUT /website/blog/1/_create
{
"title": "My first blog entry",
"text": "Just trying this out..."
}
GET /website/blog/1
{
"_index" : "website",
"_type" : "blog",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"title": "My first blog entry",
"text": "Just trying this out..."
}
}
PUT /website/blog/1?version=1
{
"title": "My first blog entry",
"text": "Starting to get the hang of this..."
}
{
"error": {
"root_cause": [
{
"type": "version_conflict_engine_exception",
"reason": "[blog][1]: version conflict, current [2], provided [1]",
"index": "website",
"shard": "3"
}
],
"type": "version_conflict_engine_exception",
"reason": "[blog][1]: version conflict, current [2], provided [1]",
"index": "website",
"shard": "3"
},
"status": 409
}
如果版本是由外部系统管理的,可以追加version_type=external
这个query string。
PUT /website/blog/2?version=5&version_type=external
部分更新文档
前面提到过一个update
方法:
POST /website/blog/1/_update
{
"doc" : {
"tags" : [ "testing" ],
"views": 0
}
}
返回:
{
"_index" : "website",
"_id" : "1",
"_type" : "blog",
"_version" : 3
}
取出的时候可以看到结果:
{
"_index": "website",
"_type": "blog",
"_id": "1",
"_version": 3,
"found": true,
"_source": {
"title": "My first blog entry",
"text": "Starting to get the hang of this...",
"tags": [ "testing" ],
"views": 0
}
}
使用脚本实现部分更新
POST /website/blog/1/_update
{
"script" : "ctx._source.views+=1"
}
Scripts can be used in the update
API to change the contents of the _source
field, which is referred to inside an update script as ctx._source
.
Scripting with Groovy
ES允许嵌入自己的逻辑脚本,很多API都支持脚本。脚本可以从一个特殊的.script
索引中取出,或从磁盘读取。
默认的脚本语言是Groovy
,处于关闭状态。你还可以通过在所有集群节点设置
script.groovy.sandbox.enabled: false
关闭沙箱,就可以从.scripts
索引和config/scripts/
目录读取脚本。
更多脚本的内容: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html
POST /website/blog/1/_update
{
"script" : "ctx._source.tags+=new_tag",
"params" : {
"new_tag" : "search"
}
}
{
"_index": "website",
"_type": "blog",
"_id": "1",
"_version": 5,
"found": true,
"_source": {
"title": "My first blog entry",
"text": "Starting to get the hang of this...",
"tags": ["testing", "search"],
"views": 1
}
}
还可以基于文档本身内容进行删除操作,通过设置ctx.op
为delete
:
POST /website/blog/1/_update
{
"script" : "ctx.op = ctx._source.views == count ? 'delete' : 'none'",
"params" : {
"count": 1
}
}
更新可能不存在的文档
假设要更新页面计数器,对于新页面可能不存在这个计数器,那么更新可能会失败。
此时我们需要upsert
操作:
POST /website/pageviews/1/_update
{
"script" : "ctx._source.views+=1",
"upsert": {
"views": 1
}
}
如果要避免更新冲突,可以利用_version
字段,追加retry_on_conflict
参数:
POST /website/pageviews/1/_update?retry_on_conflict=5
{
"script" : "ctx._source.views+=1",
"upsert": {
"views": 0
}
}
取出多个文档
如果需要一次从ES取出多个文档,可以使用mget
API。mget
期望一个docs
数组参数,每个元素包含_index
, _type
, 和_id
元数据,还可以指定_source
参数指定需要返回的字段。
GET /_mget
{
"docs" : [
{
"_index" : "website",
"_type" : "blog",
"_id" : 2
},
{
"_index" : "website",
"_type" : "pageviews",
"_id" : 1,
"_source": "views"
}
]
}
响应主体会包含一个docs
数组:
{
"docs" : [
{
"_index" : "website",
"_id" : "2",
"_type" : "blog",
"found" : true,
"_source" : {
"text" : "This is a piece of cake...",
"title" : "My first external blog entry"
},
"_version" : 10
},
{
"_index" : "website",
"_id" : "1",
"_type" : "pageviews",
"found" : true,
"_version" : 2,
"_source" : {
"views" : 2
}
}
]
}
还可以在URL上确定默认的_index
和_type
:
GET /website/blog/_mget
{
"docs" : [
{ "_id" : 2 },
{ "_type" : "pageviews", "_id" : 1 }
]
}
如果所有的文档都在同一个_index
,_type
,可以直接指定一个ids
数组:
GET /website/blog/_mget
{
"ids" : [ "2", "1" ]
}
不包含ID2
的文档 ,响应如下:
{
"docs" : [
{
"_index" : "website",
"_type" : "blog",
"_id" : "2",
"_version" : 10,
"found" : true,
"_source" : {
"title": "My first external blog entry",
"text": "This is a piece of cake..."
}
},
{
"_index" : "website",
"_type" : "blog",
"_id" : "1",
"found" : false
}
]
}
NOTE:
mget
总会返回200,即使一个文档也没找到。因为mget
本身的请求是成功的
bulk操作
_mget
只能一次性取出多条文档,但是bulk
API允许一次请求处理多个create
, index
, update
, 或delete
操作。
bulk
请求的body格式:
{ action: { metadata }}\n
{ request body }\n
{ action: { metadata }}\n
{ request body }\n
...
就像是合法的json line通过换行符(\n
)连接到一起,需要注意两点:
- Every line must end with a newline character (
\n
), including the last line. These are used as markers to allow for efficient line separation. - The lines cannot contain unescaped newline characters, as they would interfere with parsing. This means that the JSON must not be pretty-printed.
action/metadata
行指明对哪个文档执行执行什么操作
action
必须为以下之一:
create
index
update
delete
metadata
需要指明_index
, _type
, 和_id
。
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
request body
行被index
和create
操作依赖时,由_source
本身构成。由update
操作依赖时,doc
, upsert
, script
, and so forth。对于delete
无需这一行。
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title": "My first blog post" }
不指定ID时,自动生成:
{ "index": { "_index": "website", "_type": "blog" }}
{ "title": "My second blog post" }
合在一起:
POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title": "My first blog post" }
{ "index": { "_index": "website", "_type": "blog" }}
{ "title": "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }
特别注意不要少了最后一行的换行符。
ES的响应会放在items
数组,按照bulk请求的顺序。
{
"took": 4,
"errors": false,
"items": [
{ "delete": {
"_index": "website",
"_type": "blog",
"_id": "123",
"_version": 2,
"status": 200,
"found": true
}},
{ "create": {
"_index": "website",
"_type": "blog",
"_id": "123",
"_version": 3,
"status": 201
}},
{ "create": {
"_index": "website",
"_type": "blog",
"_id": "EiwfApScQiiy7TIKFxRCTw",
"_version": 1,
"status": 201
}},
{ "update": {
"_index": "website",
"_type": "blog",
"_id": "123",
"_version": 4,
"status": 200
}}
]
}
每个操作独立执行,不会影响其他操作。如果任意一个操作失败了,顶级errors
会被设置为true
。
POST /_bulk
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title": "Cannot create - it already exists" }
{ "index": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title": "But we can update it" }
{
"took": 3,
"errors": true,
"items": [
{ "create": {
"_index": "website",
"_type": "blog",
"_id": "123",
"status": 409,
"error": "DocumentAlreadyExistsException
[[website][4] [blog][123]:
document already exists]"
}},
{ "index": {
"_index": "website",
"_type": "blog",
"_id": "123",
"_version": 5,
"status": 200
}}
]
}
这同样意味着bulk
操作非原子性
: 不能用于实现交易。
不要重复自己
对同样的_index
同样的_type
批量操作时,可以仿照_mget
那样,指定默认的_index
和_type
:
POST /website/_bulk
{ "index": { "_type": "log" }}
{ "event": "User logged in" }
依然可以覆盖掉默认的_index
和_type
参数:
POST /website/log/_bulk
{ "index": {}}
{ "event": "User logged in" }
{ "index": { "_type": "blog" }}
{ "title": "Overriding the default type" }
多大的数据量算大
整个bulk请求会加载到内存中,因此bulk操作受限于硬件环境。
一般一批文档数量是1000 - 5000,根据每个文档大小调整。
关注物理内存消耗也很有用,一个好的bulk size通常是5 - 15MB。