handling relationships

transactions involving multiple documents are not. There is no way to roll back the index to its previous state if part of a transaction fails.

application-side joins

简单来说,es不允许join操作,不过你可以建立一些简单的relation,通过编程(查询两次)获取自己想要的结果。

denormalizing your data

简单来说,推荐使用适当的数据冗余来处理数据间的关系

PUT /my_index/user/1{  "name":     "John Smith",  "email":    "john@smith.com",  "dob":      "1970/10/24"}

PUT /my_index/blogpost/2{  "title":    "Relationships",  "body":     "It's complicated...",  "user":     {    "id":       1,    "name":     "John Smith"   }}

The advantage of data denormalization is speed。文档种包含所有的信息,而不要再做join

field collapsing

简单来说,就是将数据折叠起来,利用json的特性将数据分块,比如

{
"blob":{
  title:"..."
},
"user":{
  "name":{
    "firstname":"ddd",
    "lastname":"dddd"
   }
}
}

denormalization and concurrency

nested objects

PUT /my_index/blogpost/1{  "title": "Nest eggs",  "body":  "Making your money work...",  "tags":  [ "cash", "shares" ],  "comments": [     {      "name":    "John Smith",      "comment": "Great article",      "age":     28,      "stars":   4,      "date":    "2014-09-01"    },    {      "name":    "Alice White",      "comment": "More like this please",      "age":     31,      "stars":   5,      "date":    "2014-10-22"    }  ]}
GET /_search{  "query": {    "bool": {      "must": [        { "match": { "name": "Alice" }},        { "match": { "age":  28      }}       ]    }  }}

这个查询将会查到数据,原因是被分词,每个词之间的关系被破坏了,也就是说 有Alice这个term,也有age这个term,但是他们的关系丢失了

{  "title":            [ eggs, nest ],  "body":             [ making, money, work, your ],  "tags":             [ cash, shares ],  "comments.name":    [ alice, john, smith, white ],  "comments.comment": [ article, great, like, more, please, this ],  "comments.age":     [ 28, 31 ],  "comments.stars":   [ 4, 5 ],  "comments.date":    [ 2014-09-01, 2014-10-22 ]}
如何解决
再定义mapping的时候将type设置为nested,嵌套的文档将会作为一个个分离的对象
{   "comments.name":    [ john, smith ],  "comments.comment": [ article, great ],  "comments.age":     [ 28 ],  "comments.stars":   [ 4 ],  "comments.date":    [ 2014-09-01 ]}{   "comments.name":    [ alice, white ],  "comments.comment": [ like, more, please, this ],  "comments.age":     [ 31 ],  "comments.stars":   [ 5 ],  "comments.date":    [ 2014-10-22 ]}{  "title":            [ eggs, nest ],  "body":             [ making, money, work, your ],  "tags":             [ cash, shares ]}
PUT /my_index{  "mappings": {    "blogpost": {      "properties": {        "comments": {          "type": "nested",           "properties": {            "name":    { "type": "string"  },            "comment": { "type": "string"  },            "age":     { "type": "short"   },            "stars":   { "type": "short"   },            "date":    { "type": "date"    }          }        }      }    }  }}

Because nested objects are indexed as separate hidden documents, we can’t query them directly.Instead, we have to use the nested query or nested filter to access them:

GET /my_index/blogpost/_search{  "query": {    "bool": {      "must": [        { "match": { "title": "eggs" }},        {          "nested": {            "path": "comments",             "query": {              "bool": {                "must": [                  { "match": { "comments.name": "john" }},                  { "match": { "comments.age":  28     }}                ]        }}}}      ]}}}

sorting by nested fields

PUT /my_index/blogpost/2{  "title": "Investment secrets",  "body":  "What they don't tell you ...",  "tags":  [ "shares", "equities" ],  "comments": [    {      "name":    "Mary Brown",      "comment": "Lies, lies, lies",      "age":     42,      "stars":   1,      "date":    "2014-10-18"    },    {      "name":    "John Smith",      "comment": "You're making it up!",      "age":     28,      "stars":   2,      "date":    "2014-10-16"    }  ]}
GET /_search{  "query": {    "nested": {  nestedfilter      "path": "comments",      "filter": {        "range": {          "comments.date": {            "gte": "2014-10-01",            "lt":  "2014-11-01"          }        }      }    }  },  "sort": {    "comments.stars": {  对starts进行排序      "order": "asc", 升序      "mode":  "min",    最小值      "nested_filter": { The nested_filter in the sort clause is the same as the nested query in the main queryclause.        "range": {          "comments.date": {            "gte": "2014-10-01",            "lt":  "2014-11-01"          }        }      }    }  }}

Why do we need to repeat the query conditions in the  nested_filter ? The reason is that sorting happens after the query has been executed. The query matches blog posts that received comments in October, but it returns blog post documents as the result. If we didn’t include the  nested_filter  clause, we would end up sorting based on any comments that the blog post has ever received, not just those received in October.(什么玩意???没看懂)

nested aggregations

GET /my_index/blogpost/_search?search_type=count{  "aggs": {    "comments": {       "nested": {        "path": "comments"      },      "aggs": {        "by_month": {          "date_histogram": {             "field":    "comments.date",            "interval": "month",            "format":   "yyyy-MM"          },          "aggs": {            "avg_stars": {              "avg": {                "field": "comments.stars"              }            }          }        }      }    }  }}
GET /my_index/blogpost/_search?search_type=count{  "aggs": {    "comments": {      "nested": {         "path": "comments"      },      "aggs": {        "age_group": {          "histogram": {             "field":    "comments.age",            "interval": 10          },          "aggs": {            "blogposts": {              "reverse_nested": {}, 我们从nested object中返回,到root object 如果不使用reverse_nested,则无法对root object中的字段进行聚合              "aggs": {                "tags": {                  "terms": {                     "field": "tags" root object 中的字段                  }                }              }            }          }        }      }    }  }}

parent-child relationship

PUT /company{  "mappings": {    "branch": {},    "employee": {      "_parent": {        "type": "branch"      }    }  }}

finding parents by their children

GET /company/branch/_search{  "query": {    "has_child": {      "type": "employee",      "query": {        "range": {          "dob": {            "gte": "1980-01-01"          }        }      }    }  }}
GET /company/branch/_search{  "query": {    "has_child": {      "type":       "employee",      "score_mode": "max",      "query": {        "match": {          "name": "Alice Smith"        }      }    }  }}

finding children by their parents

GET /company/employee/_search{  "query": {    "has_parent": {      "type": "branch",       "query": {        "match": {          "country": "UK"        }      }    }  }}

children aggregation

GET /company/branch/_search?search_type=count{  "aggs": {    "country": {      "terms": {         "field": "country"      },      "aggs": {        "employees": {          "children": {             "type": "employee"          },          "aggs": {            "hobby": {              "terms": {                "field": "employee.hobby"              }            }          }        }      }    }  }}

grandparents and grandchildren

The shard routing of the employee document would be decided by the parent ID—london—but the london document was routed to a shard by its own parent ID—uk. It is very likely that the grandchild would end up on a different shard from its parent and grandparent, which would prevent the same-shard parent-child mapping from functioning.(why?祖父跟孩子在同一个shard,(孩子跟孙子在同一个shard)?(通过下边的分析,孩子跟孙子很明显不在同一个分片,所以祖父跟孙子也不在同一个shard),祖父跟孙子不在同一个shard?不会传递吗?)

routing: hash(ID)%shards

祖父 hash("uk")

孩子 hash("uk")

孙子 hash("london") 所以孙子所存储的shard依赖于hash("london")的值,很显然hash("uk")!=hash("london")(很显然是这样的,他们之间的关系具体取决于hash算法的实现) 所以要加一个routing="uk"

那孙子的hash算法: hash("uk");

三代将会位于同一分片。

Instead, we need to add an extra routing parameter, set to the ID of the grandparent, to ensure that all three generations are indexed on the same shard. The indexing request should look like this:

PUT /company/employee/1?parent=london&routing=uk {  "name":  "Alice Smith",  "dob":   "1970-10-24",  "hobby": "hiking"}

practical considerations

Parent-child queries can be 5 to 10 times slower than the equivalent nested query!

You can check how much memory is being used by the parent-child cache by consulting the indices-stats API (for a summary at the index level) or the node-stats API (for a summary at the node level):

GET /_nodes/stats/indices/id_cache?human 

multigenerations and concluding thoughtsedit

The ability to join multiple generations (see Grandparents and Grandchildren) sounds attractive until you think of the costs involved:

  • The more joins you have, the worse performance will be. 连接越多,性能越差。
  • Each generation of parents needs to have their string _id fields stored in memory, which can consume a lot of RAM. 每一代的父母将他们的_id存在内存中,可能会消耗大量的内存

As you consider your relationship schemes and whether parent-child is right for you, consider this advice about parent-child relationships:关于使用父子关系的建议

  • Use parent-child relationships sparingly, and only when there are many more children than parents.
  • 简洁的使用parent-child(层次关系不要太复杂),仅当孩子的数量大大多于父亲的数量的时候使用。
  • Avoid using multiple parent-child joins in a single query.
  • 避免单个查询的父子关系深度连接
  • Avoid scoring by using the has_child filter, or the has_child query with score_modeset to none.
  • 使用has_child filter,或者 has_child query避免评分。
  • Keep the parent IDs short, so that they require less memory.
  • 使parent id 尽量简洁,更加节省内存

Above all: think about the other relationship techniques that we have discussed before reaching for parent-child.

尽量使用前两种关系。

ES学习笔记十-数据建模相关推荐

  1. pythonjson数据提取_python爬虫学习笔记(十)-数据提取之JsonPath的使用

    1. JSON与JsonPATH JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式,它使得人们很容易的进行阅读和编写.同时也方便了机器进行解析和生成.适用于 ...

  2. Polyworks脚本开发学习笔记(十九)-将数据对象与参考对象对齐的方法

    Polyworks脚本开发学习笔记(十九)-将数据对象与参考对象对齐的方法 把开发手册理了一遍,发现还有几个点没有记录下来,其中一个就是使用点对的粗对齐和使用参考目标的精确对齐.为了把这个学习笔记凑够 ...

  3. OpenCV学习笔记(十二)——图像分割与提取

    在图像处理的过程中,经常需要从图像中将前景对象作为目标图像分割或者提取出来.例如,在视频监控中,观测到的是固定背景下的视频内容,而我们对背景本身并无兴趣,感兴趣的是背景中出现的车辆.行人或者其他对象. ...

  4. IOS之学习笔记十五(协议和委托的使用)

    1.协议和委托的使用 1).协议可以看下我的这篇博客 IOS之学习笔记十四(协议的定义和实现) https://blog.csdn.net/u011068702/article/details/809 ...

  5. 吴恩达《机器学习》学习笔记十四——应用机器学习的建议实现一个机器学习模型的改进

    吴恩达<机器学习>学习笔记十四--应用机器学习的建议实现一个机器学习模型的改进 一.任务介绍 二.代码实现 1.准备数据 2.代价函数 3.梯度计算 4.带有正则化的代价函数和梯度计算 5 ...

  6. 吴恩达《机器学习》学习笔记十二——机器学习系统

    吴恩达<机器学习>学习笔记十二--机器学习系统 一.设计机器学习系统的思想 1.快速实现+绘制学习曲线--寻找重点优化的方向 2.误差分析 3.数值估计 二.偏斜类问题(类别不均衡) 三. ...

  7. 吴恩达《机器学习》学习笔记十——神经网络相关(2)

    吴恩达<机器学习>学习笔记十--神经网络相关(2) 一. 代价函数 二. 反向传播算法 三. 理解反向传播算法 四. 梯度检测 五. 随机初始化 1.全部初始化为0的问题 2.随机初始化的 ...

  8. mysql 临时表 事务_MySQL学习笔记十:游标/动态SQL/临时表/事务

    逆天十三少 发表于:2020-11-12 08:12 阅读: 90次 这篇教程主要讲解了MySQL学习笔记十:游标/动态SQL/临时表/事务,并附有相关的代码样列,我觉得非常有帮助,现在分享出来大家一 ...

  9. Python语言入门这一篇就够了-学习笔记(十二万字)

    Python语言入门这一篇就够了-学习笔记(十二万字) 友情提示:先关注收藏,再查看,12万字保姆级 Python语言从入门到精通教程. 文章目录 Python语言入门这一篇就够了-学习笔记(十二万字 ...

最新文章

  1. 电脑不能打字_意外收到一台ThinkPad T400笔记本电脑,简单升级后,办公没问题...
  2. sed在替换的时候,使用变量中的值?如何在sed实现变量的替换?获取到变量中的值?...
  3. oracle备份恢复
  4. linux打印jvm内存堆栈_5款强大的JVM 性能调优监控工具
  5. 白帽子技术分析会话劫持实战讲解
  6. java double精确比较,Java float比double更精确?
  7. 兼容各种浏览器的自动左右滚动兼左右点击滚动代码
  8. 获取axios的return值
  9. javascript中基本类型和引用类型的区别分析
  10. 【Linux】du命令用法详解
  11. Centos7yum源配置PID锁定问题
  12. Python中报错:系统找不到指定的文件;浏览器似乎在未打开之前就已经退出解决方法...
  13. Kubernetes 小白学习笔记(24)--kubernetes的运维-管理Service
  14. Facebook 应用开发认证和授权登录流程
  15. chrome鼠标手势插件
  16. 最新高品质+武汉城区建筑物范围面shp格式+小区大厦学校医院占地面积
  17. 有关rand(),srand()产生随机数学习总结
  18. 除雾霾去朦胧增强色彩对比清晰画面调色插件 ClearPlus v2.1 Win/Mac AE/PR插件中文汉化版安装与使用
  19. 【迁移学习】深度域自适应网络DANN在EEG睡眠质量检测上的应用
  20. 川普哭诉“推特狂掉粉”,将用行政命令监管硅谷?

热门文章

  1. 清华大学出版社书评征集图书列表
  2. LINUX下三款QQ聊天软件全接触(最新实践和对比)
  3. 警惕!最新勒索病毒incaseformat来袭!清除方法如下!
  4. 如何根据LAC和CellID进行手机定位
  5. DNS域名服务协议和其实现Bind应用
  6. 华硕z590和微星z590哪个好
  7. 中国超级计算机gpu,英伟达(NVIDIA)Tesla GPU为全球最快的超级计算机提供动力支持...
  8. 全国信息化和软件服务业工作座谈会召开
  9. 老毛桃唯一官方网站,现已开发出适应现阶段的U盘启动盘制作工具,让老毛桃传承经典,发扬光大。 http://www.laomaotao.net/?A7510
  10. 2021哎呦百度搜索指数批量查询工具【速度快】