Elasticsearch 之（33）document数据建模实战_文件搜索_嵌套关系

前言

在《Elasticsearch 之（2）Elasticsearch核心概念》中简单提到了document 和数据库db 数据模型的差别，本文将详细讲述集中常用的数据模型。文件搜索数据建模，对类似文件系统这种的有多层级关系的数据进行建模

1、文件系统数据构造

PUT /fs
{"settings": {"analysis": {"analyzer": {"paths": { "tokenizer": "path_hierarchy"}}}}
}

path_hierarchy tokenizer讲解

/a/b/c/d --> path_hierarchy -> /a/b/c/d, /a/b/c, /a/b, /a

fs: filesystem

PUT /fs/_mapping/file
{"properties": {"name": { "type":  "keyword"},"path": { "type":  "keyword","fields": {"tree": { "type":     "text","analyzer": "paths"}}}}
}

PUT /fs/file/1
{"name":     "README.txt", "path":     "/workspace/projects/helloworld", "contents": "这是我的第一个elasticsearch程序"
}

2、对文件系统执行搜索

文件搜索需求：查找一份，内容包括elasticsearch，在/workspace/projects/hellworld这个目录下的文件

GET /fs/file/_search
{"query": {"bool": {"must": [{"match": {"contents": "elasticsearch"}},{"constant_score": {"filter": {"term": {"path": "/workspace/projects/helloworld"}}}}]}}
}

{"took": 2,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1.284885,"hits": [{"_index": "fs","_type": "file","_id": "1","_score": 1.284885,"_source": {"name": "README.txt","path": "/workspace/projects/helloworld","contents": "这是我的第一个elasticsearch程序"}}]}
}

搜索需求2：搜索/workspace目录下，内容包含elasticsearch的所有的文件

/workspace/projects/helloworld doc1

/workspace/projects doc1

/workspace doc1

GET /fs/file/_search
{"query": {"bool": {"must": [{"match": {"contents": "elasticsearch"}},{"constant_score": {"filter": {"term": {"path.tree": "/workspace"}}}}]}}
}

嵌套关系

1、做一个实验，引出来为什么需要nested object

冗余数据方式的来建模，其实用的就是object类型，我们这里又要引入一种新的object类型，nested object类型

博客，评论，做的这种数据模型

PUT /website/blogs/6
{"title": "花无缺发表的一篇帖子","content":  "我是花无缺，大家要不要考虑一下投资房产和买股票的事情啊。。。","tags":  [ "投资", "理财" ],"comments": [ {"name":    "小鱼儿","comment": "什么股票啊？推荐一下呗","age":     28,"stars":   4,"date":    "2016-09-01"},{"name":    "黄药师","comment": "我喜欢投资房产，风，险大收益也大","age":     31,"stars":   5,"date":    "2016-10-22"}]
}

被年龄是28岁的黄药师评论过的博客，搜索

GET /website/blogs/_search
{"query": {"bool": {"must": [{ "match": { "comments.name": "黄药师" }},{ "match": { "comments.age":  28      }} ]}}
}

{"took": 102,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1.8022683,"hits": [{"_index": "website","_type": "blogs","_id": "6","_score": 1.8022683,"_source": {"title": "花无缺发表的一篇帖子","content": "我是花无缺，大家要不要考虑一下投资房产和买股票的事情啊。。。","tags": ["投资","理财"],"comments": [{"name": "小鱼儿","comment": "什么股票啊？推荐一下呗","age": 28,"stars": 4,"date": "2016-09-01"},{"name": "黄药师","comment": "我喜欢投资房产，风，险大收益也大","age": 31,"stars": 5,"date": "2016-10-22"}]}}]}
}

结果是。。。好像不太对啊？？？

object类型数据结构的底层存储。。。

{"title":            [ "花无缺", "发表", "一篇", "帖子" ],"content":             [ "我", "是", "花无缺", "大家", "要不要", "考虑", "一下", "投资", "房产", "买", "股票", "事情" ],"tags":             [ "投资", "理财" ],"comments.name":    [ "小鱼儿", "黄药师" ],"comments.comment": [ "什么", "股票", "推荐", "我", "喜欢", "投资", "房产", "风险", "收益", "大" ],"comments.age":     [ 28, 31 ],"comments.stars":   [ 4, 5 ],"comments.date":    [ 2016-09-01, 2016-10-22 ]
}

object类型底层数据结构，会将一个json数组中的数据，进行扁平化

所以，直接命中了这个document，name=黄药师，age=28，正好符合

2、引入nested object类型，来解决object类型底层数据结构导致的问题

修改mapping，将comments的类型从object设置为nested

PUT /website
{"mappings": {"blogs": {"properties": {"comments": {"type": "nested", "properties": {"name":    { "type": "string"  },"comment": { "type": "string"  },"age":     { "type": "short"   },"stars":   { "type": "short"   },"date":    { "type": "date"    }}}}}}
}

{ "comments.name":    [ "小鱼儿" ],"comments.comment": [ "什么", "股票", "推荐" ],"comments.age":     [ 28 ],"comments.stars":   [ 4 ],"comments.date":    [ 2014-09-01 ]
}
{ "comments.name":    [ "黄药师" ],"comments.comment": [ "我", "喜欢", "投资", "房产", "风险", "收益", "大" ],"comments.age":     [ 31 ],"comments.stars":   [ 5 ],"comments.date":    [ 2014-10-22 ]
}
{ "title":            [ "花无缺", "发表", "一篇", "帖子" ],"body":             [ "我", "是", "花无缺", "大家", "要不要", "考虑", "一下", "投资", "房产", "买", "股票", "事情" ],"tags":             [ "投资", "理财" ]
}

再次搜索，成功了。。。

GET /website/blogs/_search
{"query": {"bool": {"must": [{"match": {"title": "花无缺"}},{"nested": {"path": "comments","score_mode": "max";"query": {"bool": {"must": [{"match": {"comments.name": "黄药师"}},{"match": {"comments.age": 28}}]}}}}]}}
}

score_mode：max，min，avg，none，默认是avg

如果搜索命中了多个nested document，如何讲个多个nested document的分数合并为一个分数

我们讲解一下基于nested object中的数据进行聚合分析

聚合数据分析的需求1：按照评论日期进行bucket划分，然后拿到每个月的评论的评分的平均值

GET /website/blogs/_search
{"size": 0, "aggs": {"comments_path": {"nested": {"path": "comments"}, "aggs": {"group_by_comments_date": {"date_histogram": {"field": "comments.date","interval": "month","format": "yyyy-MM"},"aggs": {"avg_stars": {"avg": {"field": "comments.stars"}}}}}}}
}

{"took": 52,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 2,"max_score": 0,"hits": []},"aggregations": {"comments_path": {"doc_count": 4,"group_by_comments_date": {"buckets": [{"key_as_string": "2016-08","key": 1470009600000,"doc_count": 1,"avg_stars": {"value": 3}},{"key_as_string": "2016-09","key": 1472688000000,"doc_count": 2,"avg_stars": {"value": 4.5}},{"key_as_string": "2016-10","key": 1475280000000,"doc_count": 1,"avg_stars": {"value": 5}}]}}}
}

当根据nested object类型聚合下钻时候，可以用过reverse_path, 获取其他object field进行下钻。

GET /website/blogs/_search
{"size": 0,"aggs": {"comments_path": {"nested": {"path": "comments"},"aggs": {"group_by_comments_age": {"histogram": {"field": "comments.age","interval": 10},"aggs": {"reverse_path": {"reverse_nested": {}, "aggs": {"group_by_tags": {"terms": {"field": "tags.keyword"}}}}}}}}}
}

{"took": 5,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 2,"max_score": 0,"hits": []},"aggregations": {"comments_path": {"doc_count": 4,"group_by_comments_age": {"buckets": [{"key": 20,"doc_count": 1,"reverse_path": {"doc_count": 1,"group_by_tags": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "投资","doc_count": 1},{"key": "理财","doc_count": 1}]}}},{"key": 30,"doc_count": 3,"reverse_path": {"doc_count": 2,"group_by_tags": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "大侠","doc_count": 1},{"key": "投资","doc_count": 1},{"key": "理财","doc_count": 1},{"key": "练功","doc_count": 1}]}}}]}}}
}

父子关系

nested object的建模，有个不好的地方，就是采取的是类似冗余数据的方式，将多个数据都放在一起了，维护成本就比较高

parent child建模方式，采取的是类似于关系型数据库的三范式类的建模，多个实体都分割开来，每个实体之间都通过一些关联方式，进行了父子关系的关联，各种数据不需要都放在一起，父doc和子doc分别在进行更新的时候，都不会影响对方

一对多关系的建模，维护起来比较方便，而且我们之前说过，类似关系型数据库的建模方式，应用层join的方式，会导致性能比较差，因为做多次搜索。父子关系的数据模型，不会，性能很好。因为虽然数据实体之间分割开来，但是我们在搜索的时候，由es自动为我们处理底层的关联关系，并且通过一些手段保证搜索性能。

父子关系数据模型，相对于nested数据模型来说，优点是父doc和子doc互相之间不会影响

要点：父子关系元数据映射，用于确保查询时候的高性能，但是有一个限制，就是父子数据必须存在于一个shard中

父子关系数据存在一个shard中，而且还有映射其关联关系的元数据，那么搜索父子关系数据的时候，不用跨分片，一个分片本地自己就搞定了，性能当然高咯

案例背景：研发中心员工管理案例，一个IT公司有多个研发中心，每个研发中心有多个员工

PUT /company
{"mappings": {"rd_center": {},"employee": {"_parent": {"type": "rd_center" }}}
}

父子关系建模的核心，多个type之间有父子关系，用_parent指定父type

POST /company/rd_center/_bulk
{ "index": { "_id": "1" }}
{ "name": "北京研发总部", "city": "北京", "country": "中国" }
{ "index": { "_id": "2" }}
{ "name": "上海研发中心", "city": "上海", "country": "中国" }
{ "index": { "_id": "3" }}
{ "name": "硅谷人工智能实验室", "city": "硅谷", "country": "美国" }

shard路由的时候，id=1的rd_center doc，默认会根据id进行路由，到某一个shard

PUT /company/employee/1?parent=1
{"name":  "张三","birthday":   "1970-10-24","hobby": "爬山"
}

维护父子关系的核心，parent=1，指定了这个数据的父doc的id

此时，parent-child关系，就确保了说，父doc和子doc都是保存在一个shard上的。内部原理还是doc routing，employee和rd_center的数据，都会用parent id作为routing，这样就会到一个shard

就不会根据id=1的employee doc的id进行路由了，而是根据parent=1进行路由，会根据父doc的id进行路由，那么就可以通过底层的路由机制，保证父子数据存在于一个shard中

POST /company/employee/_bulk
{ "index": { "_id": 2, "parent": "1" }}
{ "name": "李四", "birthday": "1982-05-16", "hobby": "游泳" }
{ "index": { "_id": 3, "parent": "2" }}
{ "name": "王二", "birthday": "1979-04-01", "hobby": "爬山" }
{ "index": { "_id": 4, "parent": "3" }}
{ "name": "赵五", "birthday": "1987-05-11", "hobby": "骑马" }

我们已经建立了父子关系的数据模型之后，就要基于这个模型进行各种搜索和聚合了

1、搜索有1980年以后出生的员工的研发中心

GET /company/rd_center/_search
{"query": {"has_child": {"type": "employee","query": {"range": {"birthday": {"gte": "1980-01-01"}}}}}
}

{"took": 33,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 2,"max_score": 1,"hits": [{"_index": "company","_type": "rd_center","_id": "1","_score": 1,"_source": {"name": "北京研发总部","city": "北京","country": "中国"}},{"_index": "company","_type": "rd_center","_id": "3","_score": 1,"_source": {"name": "硅谷人工智能实验室","city": "硅谷","country": "美国"}}]}
}

2、搜索有名叫张三的员工的研发中心

GET /company/rd_center/_search
{"query": {"has_child": {"type":       "employee","query": {"match": {"name": "张三"}}}}
}

{"took": 2,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1,"hits": [{"_index": "company","_type": "rd_center","_id": "1","_score": 1,"_source": {"name": "北京研发总部","city": "北京","country": "中国"}}]}
}

3、搜索有至少2个以上员工的研发中心

GET /company/rd_center/_search
{"query": {"has_child": {"type":         "employee","min_children": 2, "query": {"match_all": {}}}}
}

{"took": 5,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1,"hits": [{"_index": "company","_type": "rd_center","_id": "1","_score": 1,"_source": {"name": "北京研发总部","city": "北京","country": "中国"}}]}
}

4、搜索在中国的研发中心的员工

GET /company/employee/_search
{"query": {"has_parent": {"parent_type": "rd_center","query": {"term": {"country.keyword": "中国"}}}}
}

{"took": 5,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 3,"max_score": 1,"hits": [{"_index": "company","_type": "employee","_id": "3","_score": 1,"_routing": "2","_parent": "2","_source": {"name": "王二","birthday": "1979-04-01","hobby": "爬山"}},{"_index": "company","_type": "employee","_id": "1","_score": 1,"_routing": "1","_parent": "1","_source": {"name": "张三","birthday": "1970-10-24","hobby": "爬山"}},{"_index": "company","_type": "employee","_id": "2","_score": 1,"_routing": "1","_parent": "1","_source": {"name": "李四","birthday": "1982-05-16","hobby": "游泳"}}]}
}

5、统计每个国家的喜欢每种爱好的员工有多少个

GET /company/rd_center/_search
{"size": 0,"aggs": {"group_by_country": {"terms": {"field": "country.keyword"},"aggs": {"group_by_child_employee": {"children": {"type": "employee"},"aggs": {"group_by_hobby": {"terms": {"field": "hobby.keyword"}}}}}}}
}

{"took": 15,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 3,"max_score": 0,"hits": []},"aggregations": {"group_by_country": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "中国","doc_count": 2,"group_by_child_employee": {"doc_count": 3,"group_by_hobby": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "爬山","doc_count": 2},{"key": "游泳","doc_count": 1}]}}},{"key": "美国","doc_count": 1,"group_by_child_employee": {"doc_count": 1,"group_by_hobby": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "骑马","doc_count": 1}]}}}]}}
}

父子关系，祖孙三层关系的数据建模，搜索

PUT /company
{"mappings": {"country": {},"rd_center": {"_parent": {"type": "country" }},"employee": {"_parent": {"type": "rd_center" }}}
}

country -> rd_center -> employee，祖孙三层数据模型

POST /company/country/_bulk
{ "index": { "_id": "1" }}
{ "name": "中国" }
{ "index": { "_id": "2" }}
{ "name": "美国" }

POST /company/rd_center/_bulk
{ "index": { "_id": "1", "parent": "1" }}
{ "name": "北京研发总部" }
{ "index": { "_id": "2", "parent": "1" }}
{ "name": "上海研发中心" }
{ "index": { "_id": "3", "parent": "2" }}
{ "name": "硅谷人工智能实验室" }

PUT /company/employee/1?parent=1&routing=1
{"name":  "张三","dob":   "1970-10-24","hobby": "爬山"
}

routing参数的讲解，必须跟grandparent相同，否则有问题

country，用的是自己的id去路由; rd_center，parent，用的是country的id去路由; employee，如果也是仅仅指定一个parent，那么用的是rd_center的id去路由，这就导致祖孙三层数据不会在一个shard上

孙子辈儿，要手动指定routing，指定为爷爷辈儿的数据的id

搜索有爬山爱好的员工所在的国家

GET /company/country/_search
{"query": {"has_child": {"type": "rd_center","query": {"has_child": {"type": "employee","query": {"match": {"hobby": "爬山"}}}}}}
}

{"took": 10,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 1,"hits": [{"_index": "company","_type": "country","_id": "1","_score": 1,"_source": {"name": "中国"}}]}
}

转载于:https://www.cnblogs.com/wuzhiwei549/p/9113457.html