映射和分词(Mapping and Analysis)

ES会自动猜测文档中每个字段的类型，然后生成一个映射(Mapping)。

GET /gb/_mapping/tweet

{
   "gb": {
      "mappings": {
         "tweet": {
            "properties": {
               "date": {
                  "type": "date",
                  "format": "strict_date_optional_time||epoch_millis"
               },
               "name": {
                  "type": "string"
               },
               "tweet": {
                  "type": "string"
               },
               "user_id": {
                  "type": "long"
               }
            }
         }
      }
   }
}

没有_all字段，因为这是默认包含的，而且类型是string。

Exact Values Versus Full Text

Full Text通常被称为非结构化数据，以人类可读的形式存储，但是计算机难以解析。

ES针对Full Text会首先分析文本，然后根据结构构建一个inverted index。

Inverted Index

ES针对全文检索使用了一种称为inverted index的结构。一个倒排索引包含了一个列表，显示所有文档包含的不同的字词，以及这些字词都出现在哪些文档中。

一个示例:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

比如我们想搜索quick brown，我们只需要找到每个term出现在哪个文档:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

关键词前面跟上+，+Quick +fox代表字词必须出现，而且必须按照给定顺序出现在文档中。

分词和分词器

倒排索引对于分词(Analysis)十分依赖，ES内置了分词器，也可以调用外部。

https://www.elastic.co/guide/en/elasticsearch/guide/current/analysis-intro.html

Mapping

ES支持以下几种简单类型:

String: string
Whole number: byte, short, integer, long
Floating-point: float, double
Boolean: boolean
Date: date

当索引的新文档包含新字段时，ES按照动态模板规则猜测，按照以下规则:

JSON type	Field type
Boolean: `true` or `false`	`boolean`
Whole number: `123`	`long`
Floating point: `123.45`	`double`
String, valid date: `2014-09-15`	`date`
String: `foo` `bar`	`string`

查看Mapping:

GET /gb/_mapping/tweet

自定义Mapping

{
    "number_of_clicks": {
        "type": "integer"
    }
}

对于string类型的字段最重要的属性有index和analyzer。

index有三个合法值:

analyzed: First analyze the string and then index it. In other words, index this field as full text.
not_analyzed: Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it.
no: Don’t index this field at all. This field will not be searchable.

默认string类型的字段的index属性是analyzed。如果你想作为exact value，需要设置为not_analyzed:

{
    "tag": {
        "type":     "string",
        "index":    "not_analyzed"
    }
}

NOTE: 其他简单类型的字段同样有index属性，但是只有两个合法值no和not_analyzed，不会用到分词

更新Mapping

当你第一次create index的时候可以指定mapping，也可以使用/_mapping endpoint增加或修改mapping。

NOTE: 尽管Mapping可以修改，但不会影响到已存储的文档。

删除gb索引，然后在新建的时候指定mapping:

DELETE /gb

PUT /gb 
{
  "mappings": {
    "tweet" : {
      "properties" : {
        "tweet" : {
          "type" :    "string",
          "analyzer": "english"
        },
        "date" : {
          "type" :   "date"
        },
        "name" : {
          "type" :   "string"
        },
        "user_id" : {
          "type" :   "long"
        }
      }
    }
  }
}

修改tweet的mapping:

PUT /gb/_mapping/tweet
{
  "properties" : {
    "tag" : {
      "type" :    "string",
      "index":    "not_analyzed"
    }
  }
}

复杂类型

JSON是对象，可以包含一些嵌套类型，或者null等。

Multivalue Fields

比如tags是一个数组:

{ "tag": [ "search", "nosql" ]}

对于数组没有什么特别的mapping设置，数组类型字段可以包含任意多的值。每个值采取相同的分析规则(相同的index与analyzer等)。所以不能将不同的值类型放入一个数组中。

NOTE: 数组是有索引的，可以被查询的多值字段。你不可以指定“第一个元素”或“最后一个元素”，只是把数组当成一袋子值即可。

空值

没办法在Lucene中存储一个null值，所以，包含null值的字段会被认为是一个空字段。

下面三种字段都是空，也都不会被索引:

"null_value":               null,
"empty_array":              [],
"array_with_null_value":    [ null ]

多层级对象

对于多层级对象:

{
    "tweet":            "Elasticsearch is very flexible",
    "user": {
        "id":           "@johnsmith",
        "gender":       "male",
        "age":          26,
        "name": {
            "full":     "John Smith",
            "first":    "John",
            "last":     "Smith"
        }
    }
}

ES会生成这样的Mapping:

{
  "gb": {
    "tweet": { 
      "properties": {
        "tweet":            { "type": "string" },
        "user": { 
          "type":             "object",
          "properties": {
            "id":           { "type": "string" },
            "gender":       { "type": "string" },
            "age":          { "type": "long"   },
            "name":   { 
              "type":         "object",
              "properties": {
                "full":     { "type": "string" },
                "first":    { "type": "string" },
                "last":     { "type": "string" }
              }
            }
          }
        }
      }
    }
  }
}

多级Object表现就和根对象(root object)，只是缺少了某些元数据，如_source,_all等。

Lucence并不能理解这种嵌套对象，所以ES会转成这种对象:

{
    "tweet":            [elasticsearch, flexible, very],
    "user.id":          [@johnsmith],
    "user.gender":      [male],
    "user.age":         [26],
    "user.name.full":   [john, smith],
    "user.name.first":  [john],
    "user.name.last":   [smith]
}

对于数组类型的对象:

{
    "followers": [
        { "age": 35, "name": "Mary White"},
        { "age": 26, "name": "Alex Jones"},
        { "age": 19, "name": "Lisa Smith"}
    ]
}

会解析成这样:

{
    "followers.age":    [19, 26, 35],
    "followers.name":   [alex, jones, lisa, smith, mary, white]
}

需要注意的是，数组在ES中只是一袋值，而不是有序值。所以你只能查出有没有26岁的followers，但是你不能得到这样准确的问题有没有一个叫Alex Jones的26岁的followers?

更多细节参考官方文档Nested Objects

Mapping and Analysis