ES学习笔记十-数据建模

handling relationships

transactions involving multiple documents are not. There is no way to roll back the index to its previous state if part of a transaction fails.

application-side joins

简单来说，es不允许join操作，不过你可以建立一些简单的relation，通过编程(查询两次)获取自己想要的结果。

denormalizing your data

简单来说，推荐使用适当的数据冗余来处理数据间的关系

PUT /my_index/user/1{  "name":     "John Smith",  "email":    "john@smith.com",  "dob":      "1970/10/24"}

PUT /my_index/blogpost/2{  "title":    "Relationships",  "body":     "It's complicated...",  "user":     {    "id":       1,    "name":     "John Smith"   }}

The advantage of data denormalization is speed。文档种包含所有的信息，而不要再做join

field collapsing

简单来说，就是将数据折叠起来，利用json的特性将数据分块，比如

{

"blob":{

title:"..."

"user":{

"name":{

"firstname":"ddd",

"lastname":"dddd"

}

denormalization and concurrency

nested objects

PUT /my_index/blogpost/1{  "title": "Nest eggs",  "body":  "Making your money work...",  "tags":  [ "cash", "shares" ],  "comments": [     {      "name":    "John Smith",      "comment": "Great article",      "age":     28,      "stars":   4,      "date":    "2014-09-01"    },    {      "name":    "Alice White",      "comment": "More like this please",      "age":     31,      "stars":   5,      "date":    "2014-10-22"    }  ]}

GET /_search{  "query": {    "bool": {      "must": [        { "match": { "name": "Alice" }},        { "match": { "age":  28      }}       ]    }  }}

这个查询将会查到数据，原因是被分词，每个词之间的关系被破坏了，也就是说有Alice这个term,也有age这个term,但是他们的关系丢失了

{  "title":            [ eggs, nest ],  "body":             [ making, money, work, your ],  "tags":             [ cash, shares ],  "comments.name":    [ alice, john, smith, white ],  "comments.comment": [ article, great, like, more, please, this ],  "comments.age":     [ 28, 31 ],  "comments.stars":   [ 4, 5 ],  "comments.date":    [ 2014-09-01, 2014-10-22 ]}

如何解决

再定义mapping的时候将type设置为nested，嵌套的文档将会作为一个个分离的对象

{   "comments.name":    [ john, smith ],  "comments.comment": [ article, great ],  "comments.age":     [ 28 ],  "comments.stars":   [ 4 ],  "comments.date":    [ 2014-09-01 ]}{   "comments.name":    [ alice, white ],  "comments.comment": [ like, more, please, this ],  "comments.age":     [ 31 ],  "comments.stars":   [ 5 ],  "comments.date":    [ 2014-10-22 ]}{  "title":            [ eggs, nest ],  "body":             [ making, money, work, your ],  "tags":             [ cash, shares ]}

PUT /my_index{  "mappings": {    "blogpost": {      "properties": {        "comments": {          "type": "nested",           "properties": {            "name":    { "type": "string"  },            "comment": { "type": "string"  },            "age":     { "type": "short"   },            "stars":   { "type": "short"   },            "date":    { "type": "date"    }          }        }      }    }  }}

Because nested objects are indexed as separate hidden documents, we can’t query them directly.Instead, we have to use the nested query or nested filter to access them:

GET /my_index/blogpost/_search{  "query": {    "bool": {      "must": [        { "match": { "title": "eggs" }},        {          "nested": {            "path": "comments",             "query": {              "bool": {                "must": [                  { "match": { "comments.name": "john" }},                  { "match": { "comments.age":  28     }}                ]        }}}}      ]}}}

sorting by nested fields

PUT /my_index/blogpost/2{  "title": "Investment secrets",  "body":  "What they don't tell you ...",  "tags":  [ "shares", "equities" ],  "comments": [    {      "name":    "Mary Brown",      "comment": "Lies, lies, lies",      "age":     42,      "stars":   1,      "date":    "2014-10-18"    },    {      "name":    "John Smith",      "comment": "You're making it up!",      "age":     28,      "stars":   2,      "date":    "2014-10-16"    }  ]}

GET /_search{  "query": {    "nested": {  nestedfilter      "path": "comments",      "filter": {        "range": {          "comments.date": {            "gte": "2014-10-01",            "lt":  "2014-11-01"          }        }      }    }  },  "sort": {    "comments.stars": {  对starts进行排序      "order": "asc", 升序      "mode":  "min",    最小值      "nested_filter": { The nested_filter in the sort clause is the same as the nested query in the main queryclause.        "range": {          "comments.date": {            "gte": "2014-10-01",            "lt":  "2014-11-01"          }        }      }    }  }}

Why do we need to repeat the query conditions in the nested_filter ? The reason is that sorting happens after the query has been executed. The query matches blog posts that received comments in October, but it returns blog post documents as the result. If we didn’t include the nested_filter clause, we would end up sorting based on any comments that the blog post has ever received, not just those received in October.(什么玩意？？？没看懂)

nested aggregations

GET /my_index/blogpost/_search?search_type=count{  "aggs": {    "comments": {       "nested": {        "path": "comments"      },      "aggs": {        "by_month": {          "date_histogram": {             "field":    "comments.date",            "interval": "month",            "format":   "yyyy-MM"          },          "aggs": {            "avg_stars": {              "avg": {                "field": "comments.stars"              }            }          }        }      }    }  }}

GET /my_index/blogpost/_search?search_type=count{  "aggs": {    "comments": {      "nested": {         "path": "comments"      },      "aggs": {        "age_group": {          "histogram": {             "field":    "comments.age",            "interval": 10          },          "aggs": {            "blogposts": {              "reverse_nested": {}, 我们从nested object中返回，到root object 如果不使用reverse_nested,则无法对root object中的字段进行聚合              "aggs": {                "tags": {                  "terms": {                     "field": "tags" root object 中的字段                  }                }              }            }          }        }      }    }  }}

parent-child relationship

PUT /company{  "mappings": {    "branch": {},    "employee": {      "_parent": {        "type": "branch"      }    }  }}

finding parents by their children

GET /company/branch/_search{  "query": {    "has_child": {      "type": "employee",      "query": {        "range": {          "dob": {            "gte": "1980-01-01"          }        }      }    }  }}

GET /company/branch/_search{  "query": {    "has_child": {      "type":       "employee",      "score_mode": "max",      "query": {        "match": {          "name": "Alice Smith"        }      }    }  }}

finding children by their parents

GET /company/employee/_search{  "query": {    "has_parent": {      "type": "branch",       "query": {        "match": {          "country": "UK"        }      }    }  }}

children aggregation

GET /company/branch/_search?search_type=count{  "aggs": {    "country": {      "terms": {         "field": "country"      },      "aggs": {        "employees": {          "children": {             "type": "employee"          },          "aggs": {            "hobby": {              "terms": {                "field": "employee.hobby"              }            }          }        }      }    }  }}

grandparents and grandchildren

The shard routing of the employee document would be decided by the parent ID—london—but the london document was routed to a shard by its own parent ID—uk. It is very likely that the grandchild would end up on a different shard from its parent and grandparent, which would prevent the same-shard parent-child mapping from functioning.(why？祖父跟孩子在同一个shard，(孩子跟孙子在同一个shard)?(通过下边的分析，孩子跟孙子很明显不在同一个分片，所以祖父跟孙子也不在同一个shard)，祖父跟孙子不在同一个shard？不会传递吗？)

routing: hash(ID)%shards

祖父 hash("uk")

孩子 hash("uk")

孙子 hash("london") 所以孙子所存储的shard依赖于hash("london")的值，很显然hash("uk")!=hash("london")(很显然是这样的，他们之间的关系具体取决于hash算法的实现) 所以要加一个routing="uk"

那孙子的hash算法: hash("uk");

三代将会位于同一分片。

Instead, we need to add an extra routing parameter, set to the ID of the grandparent, to ensure that all three generations are indexed on the same shard. The indexing request should look like this:

PUT /company/employee/1?parent=london&routing=uk {  "name":  "Alice Smith",  "dob":   "1970-10-24",  "hobby": "hiking"}

practical considerations

Parent-child queries can be 5 to 10 times slower than the equivalent nested query!

You can check how much memory is being used by the parent-child cache by consulting the indices-stats API (for a summary at the index level) or the node-stats API (for a summary at the node level):

GET /_nodes/stats/indices/id_cache?human

multigenerations and concluding thoughtsedit

The ability to join multiple generations (see Grandparents and Grandchildren) sounds attractive until you think of the costs involved:

The more joins you have, the worse performance will be. 连接越多，性能越差。
Each generation of parents needs to have their string _id fields stored in memory, which can consume a lot of RAM. 每一代的父母将他们的_id存在内存中，可能会消耗大量的内存

As you consider your relationship schemes and whether parent-child is right for you, consider this advice about parent-child relationships:关于使用父子关系的建议

Use parent-child relationships sparingly, and only when there are many more children than parents.
简洁的使用parent-child(层次关系不要太复杂)，仅当孩子的数量大大多于父亲的数量的时候使用。
Avoid using multiple parent-child joins in a single query.
避免单个查询的父子关系深度连接
Avoid scoring by using the has_child filter, or the has_child query with score_modeset to none.
使用has_child filter,或者 has_child query避免评分。
Keep the parent IDs short, so that they require less memory.
使parent id 尽量简洁，更加节省内存

Above all: think about the other relationship techniques that we have discussed before reaching for parent-child.

尽量使用前两种关系。