前言

在《Elasticsearch 之(2)Elasticsearch核心概念》中简单提到了document 和 数据库db 数据模型的差别,本文将详细讲述集中常用的数据模型。文件搜索数据建模,对类似文件系统这种的有多层级关系的数据进行建模

1、文件系统数据构造
PUT /fs
{"settings": {"analysis": {"analyzer": {"paths": { "tokenizer": "path_hierarchy"}}}}
}

path_hierarchy tokenizer讲解
/a/b/c/d --> path_hierarchy -> /a/b/c/d,     /a/b/c,     /a/b, /a
fs: filesystem
PUT /fs/_mapping/file
{"properties": {"name": { "type":  "keyword"},"path": { "type":  "keyword","fields": {"tree": { "type":     "text","analyzer": "paths"}}}}
}
PUT /fs/file/1
{"name":     "README.txt", "path":     "/workspace/projects/helloworld", "contents": "这是我的第一个elasticsearch程序"
}

2、对文件系统执行搜索
文件搜索需求:查找一份,内容包括elasticsearch,在/workspace/projects/hellworld这个目录下的文件
GET /fs/file/_search
{"query": {"bool": {"must": [{"match": {"contents": "elasticsearch"}},{"constant_score": {"filter": {"term": {"path": "/workspace/projects/helloworld"}}}}]}}
}
{"took": 2,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1.284885,"hits": [{"_index": "fs","_type": "file","_id": "1","_score": 1.284885,"_source": {"name": "README.txt","path": "/workspace/projects/helloworld","contents": "这是我的第一个elasticsearch程序"}}]}
}

搜索需求2:搜索/workspace目录下,内容包含elasticsearch的所有的文件

/workspace/projects/helloworld doc1
/workspace/projects          doc1
/workspace                  doc1
GET /fs/file/_search
{"query": {"bool": {"must": [{"match": {"contents": "elasticsearch"}},{"constant_score": {"filter": {"term": {"path.tree": "/workspace"}}}}]}}
}

嵌套关系
1、做一个实验,引出来为什么需要nested object
冗余数据方式的来建模,其实用的就是object类型,我们这里又要引入一种新的object类型,nested object类型
博客,评论,做的这种数据模型
PUT /website/blogs/6
{"title": "花无缺发表的一篇帖子","content":  "我是花无缺,大家要不要考虑一下投资房产和买股票的事情啊。。。","tags":  [ "投资", "理财" ],"comments": [ {"name":    "小鱼儿","comment": "什么股票啊?推荐一下呗","age":     28,"stars":   4,"date":    "2016-09-01"},{"name":    "黄药师","comment": "我喜欢投资房产,风,险大收益也大","age":     31,"stars":   5,"date":    "2016-10-22"}]
}

被年龄是28岁的黄药师评论过的博客,搜索
GET /website/blogs/_search
{"query": {"bool": {"must": [{ "match": { "comments.name": "黄药师" }},{ "match": { "comments.age":  28      }} ]}}
}
{"took": 102,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1.8022683,"hits": [{"_index": "website","_type": "blogs","_id": "6","_score": 1.8022683,"_source": {"title": "花无缺发表的一篇帖子","content": "我是花无缺,大家要不要考虑一下投资房产和买股票的事情啊。。。","tags": ["投资","理财"],"comments": [{"name": "小鱼儿","comment": "什么股票啊?推荐一下呗","age": 28,"stars": 4,"date": "2016-09-01"},{"name": "黄药师","comment": "我喜欢投资房产,风,险大收益也大","age": 31,"stars": 5,"date": "2016-10-22"}]}}]}
}

结果是。。。好像不太对啊???
object类型数据结构的底层存储。。。
{"title":            [ "花无缺", "发表", "一篇", "帖子" ],"content":             [ "我", "是", "花无缺", "大家", "要不要", "考虑", "一下", "投资", "房产", "买", "股票", "事情" ],"tags":             [ "投资", "理财" ],"comments.name":    [ "小鱼儿", "黄药师" ],"comments.comment": [ "什么", "股票", "推荐", "我", "喜欢", "投资", "房产", "风险", "收益", "大" ],"comments.age":     [ 28, 31 ],"comments.stars":   [ 4, 5 ],"comments.date":    [ 2016-09-01, 2016-10-22 ]
}

object类型底层数据结构,会将一个json数组中的数据,进行扁平化
所以,直接命中了这个document,name=黄药师,age=28,正好符合
2、引入nested object类型,来解决object类型底层数据结构导致的问题
修改mapping,将comments的类型从object设置为nested
PUT /website
{"mappings": {"blogs": {"properties": {"comments": {"type": "nested", "properties": {"name":    { "type": "string"  },"comment": { "type": "string"  },"age":     { "type": "short"   },"stars":   { "type": "short"   },"date":    { "type": "date"    }}}}}}
}
{ "comments.name":    [ "小鱼儿" ],"comments.comment": [ "什么", "股票", "推荐" ],"comments.age":     [ 28 ],"comments.stars":   [ 4 ],"comments.date":    [ 2014-09-01 ]
}
{ "comments.name":    [ "黄药师" ],"comments.comment": [ "我", "喜欢", "投资", "房产", "风险", "收益", "大" ],"comments.age":     [ 31 ],"comments.stars":   [ 5 ],"comments.date":    [ 2014-10-22 ]
}
{ "title":            [ "花无缺", "发表", "一篇", "帖子" ],"body":             [ "我", "是", "花无缺", "大家", "要不要", "考虑", "一下", "投资", "房产", "买", "股票", "事情" ],"tags":             [ "投资", "理财" ]
}

再次搜索,成功了。。。
GET /website/blogs/_search
{"query": {"bool": {"must": [{"match": {"title": "花无缺"}},{"nested": {"path": "comments","score_mode": "max";"query": {"bool": {"must": [{"match": {"comments.name": "黄药师"}},{"match": {"comments.age": 28}}]}}}}]}}
}

score_mode:max,min,avg,none,默认是avg
如果搜索命中了多个nested document,如何讲个多个nested document的分数合并为一个分数
我们讲解一下基于nested object中的数据进行聚合分析
聚合数据分析的需求1:按照评论日期进行bucket划分,然后拿到每个月的评论的评分的平均值
GET /website/blogs/_search
{"size": 0, "aggs": {"comments_path": {"nested": {"path": "comments"}, "aggs": {"group_by_comments_date": {"date_histogram": {"field": "comments.date","interval": "month","format": "yyyy-MM"},"aggs": {"avg_stars": {"avg": {"field": "comments.stars"}}}}}}}
}
{"took": 52,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 2,"max_score": 0,"hits": []},"aggregations": {"comments_path": {"doc_count": 4,"group_by_comments_date": {"buckets": [{"key_as_string": "2016-08","key": 1470009600000,"doc_count": 1,"avg_stars": {"value": 3}},{"key_as_string": "2016-09","key": 1472688000000,"doc_count": 2,"avg_stars": {"value": 4.5}},{"key_as_string": "2016-10","key": 1475280000000,"doc_count": 1,"avg_stars": {"value": 5}}]}}}
}

当根据nested object类型聚合下钻时候,可以用过reverse_path, 获取其他object field进行下钻。

GET /website/blogs/_search
{"size": 0,"aggs": {"comments_path": {"nested": {"path": "comments"},"aggs": {"group_by_comments_age": {"histogram": {"field": "comments.age","interval": 10},"aggs": {"reverse_path": {"reverse_nested": {}, "aggs": {"group_by_tags": {"terms": {"field": "tags.keyword"}}}}}}}}}
}
{"took": 5,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 2,"max_score": 0,"hits": []},"aggregations": {"comments_path": {"doc_count": 4,"group_by_comments_age": {"buckets": [{"key": 20,"doc_count": 1,"reverse_path": {"doc_count": 1,"group_by_tags": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "投资","doc_count": 1},{"key": "理财","doc_count": 1}]}}},{"key": 30,"doc_count": 3,"reverse_path": {"doc_count": 2,"group_by_tags": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "大侠","doc_count": 1},{"key": "投资","doc_count": 1},{"key": "理财","doc_count": 1},{"key": "练功","doc_count": 1}]}}}]}}}
}
父子关系
nested object的建模,有个不好的地方,就是采取的是类似冗余数据的方式,将多个数据都放在一起了,维护成本就比较高
parent child建模方式,采取的是类似于关系型数据库的三范式类的建模,多个实体都分割开来,每个实体之间都通过一些关联方式,进行了父子关系的关联,各种数据不需要都放在一起,父doc和子doc分别在进行更新的时候,都不会影响对方
一对多关系的建模,维护起来比较方便,而且我们之前说过,类似关系型数据库的建模方式,应用层join的方式,会导致性能比较差,因为做多次搜索。父子关系的数据模型,不会,性能很好。因为虽然数据实体之间分割开来,但是我们在搜索的时候,由es自动为我们处理底层的关联关系,并且通过一些手段保证搜索性能。
父子关系数据模型,相对于nested数据模型来说,优点是父doc和子doc互相之间不会影响
要点:父子关系元数据映射,用于确保查询时候的高性能,但是有一个限制,就是父子数据必须存在于一个shard中
父子关系数据存在一个shard中,而且还有映射其关联关系的元数据,那么搜索父子关系数据的时候,不用跨分片,一个分片本地自己就搞定了,性能当然高咯
案例背景:研发中心员工管理案例,一个IT公司有多个研发中心,每个研发中心有多个员工
PUT /company
{"mappings": {"rd_center": {},"employee": {"_parent": {"type": "rd_center" }}}
}

父子关系建模的核心,多个type之间有父子关系,用_parent指定父type
POST /company/rd_center/_bulk
{ "index": { "_id": "1" }}
{ "name": "北京研发总部", "city": "北京", "country": "中国" }
{ "index": { "_id": "2" }}
{ "name": "上海研发中心", "city": "上海", "country": "中国" }
{ "index": { "_id": "3" }}
{ "name": "硅谷人工智能实验室", "city": "硅谷", "country": "美国" }

shard路由的时候,id=1的rd_center doc,默认会根据id进行路由,到某一个shard
PUT /company/employee/1?parent=1
{"name":  "张三","birthday":   "1970-10-24","hobby": "爬山"
}

维护父子关系的核心,parent=1,指定了这个数据的父doc的id
此时,parent-child关系,就确保了说,父doc和子doc都是保存在一个shard上的。内部原理还是doc routing,employee和rd_center的数据,都会用parent id作为routing,这样就会到一个shard
就不会根据id=1的employee doc的id进行路由了,而是根据parent=1进行路由,会根据父doc的id进行路由,那么就可以通过底层的路由机制,保证父子数据存在于一个shard中
POST /company/employee/_bulk
{ "index": { "_id": 2, "parent": "1" }}
{ "name": "李四", "birthday": "1982-05-16", "hobby": "游泳" }
{ "index": { "_id": 3, "parent": "2" }}
{ "name": "王二", "birthday": "1979-04-01", "hobby": "爬山" }
{ "index": { "_id": 4, "parent": "3" }}
{ "name": "赵五", "birthday": "1987-05-11", "hobby": "骑马" }
我们已经建立了父子关系的数据模型之后,就要基于这个模型进行各种搜索和聚合了
1、搜索有1980年以后出生的员工的研发中心
GET /company/rd_center/_search
{"query": {"has_child": {"type": "employee","query": {"range": {"birthday": {"gte": "1980-01-01"}}}}}
}
{"took": 33,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 2,"max_score": 1,"hits": [{"_index": "company","_type": "rd_center","_id": "1","_score": 1,"_source": {"name": "北京研发总部","city": "北京","country": "中国"}},{"_index": "company","_type": "rd_center","_id": "3","_score": 1,"_source": {"name": "硅谷人工智能实验室","city": "硅谷","country": "美国"}}]}
}

2、搜索有名叫张三的员工的研发中心
GET /company/rd_center/_search
{"query": {"has_child": {"type":       "employee","query": {"match": {"name": "张三"}}}}
}
{"took": 2,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1,"hits": [{"_index": "company","_type": "rd_center","_id": "1","_score": 1,"_source": {"name": "北京研发总部","city": "北京","country": "中国"}}]}
}

3、搜索有至少2个以上员工的研发中心
GET /company/rd_center/_search
{"query": {"has_child": {"type":         "employee","min_children": 2, "query": {"match_all": {}}}}
}
{"took": 5,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1,"hits": [{"_index": "company","_type": "rd_center","_id": "1","_score": 1,"_source": {"name": "北京研发总部","city": "北京","country": "中国"}}]}
}

4、搜索在中国的研发中心的员工
GET /company/employee/_search
{"query": {"has_parent": {"parent_type": "rd_center","query": {"term": {"country.keyword": "中国"}}}}
}
{"took": 5,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 3,"max_score": 1,"hits": [{"_index": "company","_type": "employee","_id": "3","_score": 1,"_routing": "2","_parent": "2","_source": {"name": "王二","birthday": "1979-04-01","hobby": "爬山"}},{"_index": "company","_type": "employee","_id": "1","_score": 1,"_routing": "1","_parent": "1","_source": {"name": "张三","birthday": "1970-10-24","hobby": "爬山"}},{"_index": "company","_type": "employee","_id": "2","_score": 1,"_routing": "1","_parent": "1","_source": {"name": "李四","birthday": "1982-05-16","hobby": "游泳"}}]}
}

5、统计每个国家的喜欢每种爱好的员工有多少个

GET /company/rd_center/_search
{"size": 0,"aggs": {"group_by_country": {"terms": {"field": "country.keyword"},"aggs": {"group_by_child_employee": {"children": {"type": "employee"},"aggs": {"group_by_hobby": {"terms": {"field": "hobby.keyword"}}}}}}}
}
{"took": 15,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 3,"max_score": 0,"hits": []},"aggregations": {"group_by_country": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "中国","doc_count": 2,"group_by_child_employee": {"doc_count": 3,"group_by_hobby": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "爬山","doc_count": 2},{"key": "游泳","doc_count": 1}]}}},{"key": "美国","doc_count": 1,"group_by_child_employee": {"doc_count": 1,"group_by_hobby": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "骑马","doc_count": 1}]}}}]}}
}
父子关系,祖孙三层关系的数据建模,搜索
PUT /company
{"mappings": {"country": {},"rd_center": {"_parent": {"type": "country" }},"employee": {"_parent": {"type": "rd_center" }}}
}

country -> rd_center -> employee,祖孙三层数据模型
POST /company/country/_bulk
{ "index": { "_id": "1" }}
{ "name": "中国" }
{ "index": { "_id": "2" }}
{ "name": "美国" }
POST /company/rd_center/_bulk
{ "index": { "_id": "1", "parent": "1" }}
{ "name": "北京研发总部" }
{ "index": { "_id": "2", "parent": "1" }}
{ "name": "上海研发中心" }
{ "index": { "_id": "3", "parent": "2" }}
{ "name": "硅谷人工智能实验室" }

PUT /company/employee/1?parent=1&routing=1
{"name":  "张三","dob":   "1970-10-24","hobby": "爬山"
} 

routing参数的讲解,必须跟grandparent相同,否则有问题
country,用的是自己的id去路由; rd_center,parent,用的是country的id去路由; employee,如果也是仅仅指定一个parent,那么用的是rd_center的id去路由,这就导致祖孙三层数据不会在一个shard上
孙子辈儿,要手动指定routing,指定为爷爷辈儿的数据的id
搜索有爬山爱好的员工所在的国家
GET /company/country/_search
{"query": {"has_child": {"type": "rd_center","query": {"has_child": {"type": "employee","query": {"match": {"hobby": "爬山"}}}}}}
}
{"took": 10,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1,"hits": [{"_index": "company","_type": "country","_id": "1","_score": 1,"_source": {"name": "中国"}}]}
}

转载于:https://www.cnblogs.com/wuzhiwei549/p/9113457.html

Elasticsearch 之(33)document数据建模实战_文件搜索_嵌套关系_父子/祖孙关系数据...相关推荐

  1. 数据建模实战:方寸之间玩转购物篮分析

    购物篮分析是零售行业里非常重要经典的一个模型,曾经被大家津津乐道的啤酒与尿布的故事,相信大家都还记忆犹新,这个故事很好地诠释了商品关联性对销售额的提升作用,时至今日,仍有很强的现实指导意义.这种通过研 ...

  2. 数据建模实战,Smartbi带你玩转购物篮分析

    购物篮分析是一个非常重要的模型,关于啤酒与尿布的故事,这个故事很好地解释了商品关联性的作用,时至今日,仍有很强的现实指导意义.这种数据,将不同商品关联起来,并挖掘二者之间联系的分析方法,就叫作&quo ...

  3. 局域网传文件_文件搜索神器Everything使用系列教程之——文件互传篇

    本文接上篇 文件搜索神器Everything使用系列教程之--搜索篇. 众所周知,Everything是一款文件搜索软件,它如何做文件互传呢? 别看Everything小巧,它竟然内置了FTP服务器和 ...

  4. python保存模型 drop_(长期更新)【python数据建模实战】零零散散问题及解决方案梳理...

    注1:本文旨在梳理汇总出我们在建模过程中遇到的零碎小问题及解决方案(即当作一份答疑文档),会不定期更新,不断完善, 也欢迎大家提问,我会填写进来. 注2:感谢阅读.为方便您查找想要问题的答案,可以就本 ...

  5. mfc创建excel如何另存为_mfc表格数据保存为excel文件-VC (MFC)如何从对话框写数据到Excel...

    我现在把Excel表格嵌入到MFC单文档界面,然后对嵌... 1.首先,打开媒介工具"记事本",将word文件里需要导入的数据,复制粘贴到记事本当中,然后保存成为txt文件,本例中 ...

  6. linux笔记_文件搜索命令

    一.locate命令 locate命令属于mlocate包,如果执行locate filename提示命令未找到执行安装mlocate包 # yum -y install mlocate 安装后执行l ...

  7. java 泛型 父子_使用通配符和泛型:完成父子类关系的List对象的类型匹配

    泛型和通配符 使用泛型和通配符都可以让一个方法所表示的算法逻辑适应多种类型. Java中具备继承关系的类A.B(A extends B)它们的集合List和List之间是没有继承关系的, 可以使用泛型 ...

  8. 干货 | Elasticsearch 数据建模指南

    0.题记 我在做 Elasticsearch 相关咨询和培训过程中,发现大家普遍更关注实战中涉及的问题,下面我选取几个常见且典型的问题,和大家一起分析一下. 订单表.账单表父子文档可以实现类似 SQL ...

  9. 向《数据科学实战》作者Cathy O'Neil提问!

    Cathy O'Neil是约翰逊实验室高级数据科学家.哈佛大学数学博士.麻省理工学院数学系博士后.巴纳德学院教授,曾发表过大量算术代数几何方面的论文.他曾在著名的全球投资管理公司D.E. Shaw担任 ...

  10. 如何用开源组件“攒”出一个大数据建模平台?

    写在前面:博主是一只经过实战开发历练后投身培训事业的"小山猪",昵称取自动画片<狮子王>中的"彭彭",总是以乐观.积极的心态对待周边的事物.本人的技 ...

最新文章

  1. Spring 实践 -IoC
  2. 【译】使用Kotlin和RxJava测试MVP架构的完整示例 - 第1部分
  3. 124 Binary Tree Maximum Path Sum
  4. 如何清除SQL数据库日志,清除后对数据库有什么影响
  5. c语言建立线性表(顺序储存,链式储存,循环,双向)全
  6. 无根树转为有根数(图论) By ACReaper
  7. 大牛深入讲解!最经典的HashMap图文详解
  8. [剑指offer]面试题第[66]题[构建乘积数组][Leetcode][JAVA][第238题][除自身以外数组的乘积][数组]
  9. 数组多重筛选条件排序方法
  10. 在c#中使用全局快捷键
  11. 队列 句子分析 精辟的诠释 有图片
  12. 三绕组变压器参数计算matlab,三绕组变压器等值参数计算
  13. 计算机u盘能直接拨出吗,电脑怎么直接拔出U盘而不丢失数据|电脑可以不用弹出设备直接拔出U盘吗...
  14. 金仓数据库KingbaseES的连接方法
  15. 【反思】写在腾讯电话面试之后
  16. 南航里程每年清空吗_南航里程即将大幅贬值!此期限前使用仍能保值
  17. 数值型数据的表示(3.0)
  18. java 线程安全和不安全
  19. 市场新格局,分享购商业模式异军突起
  20. AD模数转化/DA数模转换

热门文章

  1. 系统学习机器学习之参数方法(二)
  2. 2018-CBAM论文讲解
  3. 通过高速计算机网络和多媒体,全国2014.10办公自动化原理及应用试题
  4. python中plt定义,对Python中plt的画图函数详解
  5. Zephyr_Bindings目录作用
  6. 热修复 阿里的AndFix
  7. 国内pip源提示“not a trusted or secure host”解决方案
  8. POJ 3126 Prime Path 简单广搜(BFS)
  9. java 虚拟机--新生代与老年代GC [转]
  10. 在VC6.0中使用GDI+的两种办法