Elesticsearch(es)聚合搜索（入门到精通）4

1 对于分词的field执行aggregation，发现报错。。。

GET /test_index/test_type/_search
{
"aggs": {
"group_by_test_field": {
"terms": {
"field": "test_field"
}
}
}
}

{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [test_field] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "test_index",
"node": "4onsTYVZTjGvIj9_spWz2w",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [test_field] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [test_field] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
},
"status": 400
}

错误内容
对分词的field，直接执行聚合操作，会报错，大概意思是说，你必须要打开fielddata，然后将正排索引数据加载到内存中，才可以对分词的field执行聚合操作，而且会消耗很大的内存

使用内置field不分词，对string field进行聚合

GET /test_index/test_type/_search
{
"size": 0,
"aggs": {
"group_by_test_field": {
"terms": {
"field": "test_field.keyword"
}
}
}
}

{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_test_field": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "test",
"doc_count": 2
}
]
}
}
}

如果对不分词的field执行聚合操作，直接就可以执行，不需要设置fieldata=true

分词field+fielddata的工作原理

doc value --> 在index-time，就会自动生成doc value --> 针对这些不分词的field执行聚合操作的时候，自动就会用doc value来执行

分词field，是没有doc value的。。。在index-time，如果某个field是分词的，那么是不会给它建立doc value正排索引的，因为分词后，占用的空间过于大，所以默认是不支持分词field进行聚合的

分词field默认没有doc value，所以直接对分词field执行聚合操作，是会报错的

正排索引，加载到内存，会耗费内存空间，分词的字符串，需要按照term进行聚合，需要执行更加复杂的算法和操作，如果基于磁盘和os cache，那么性能会很差

2 fielddata内存控制以及circuit breaker断路器

1、fielddata核心原理

fielddata加载到内存的过程是lazy加载的，对一个analzyed field执行聚合时，才会加载，而且是field-level加载的
一个index的一个field，所有doc都会被加载，而不是少数doc
不是index-time创建，是query-time创建

2、fielddata内存限制

indices.fielddata.cache.size: 20%，超出限制，清除内存已有fielddata数据
fielddata占用的内存超出了这个比例的限制，那么就清除掉内存中已有的fielddata数据
默认无限制，限制内存使用，但是会导致频繁evict和reload，大量IO性能损耗，以及内存碎片和gc

3、监控fielddata内存使用

GET /_stats/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?level=indices&fields=*

4、circuit breaker

如果一次query load的feilddata超过总内存，就会oom --> 内存溢出

circuit breaker会估算query要加载的fielddata大小，如果超出总内存，就短路，query直接失败

indices.breaker.fielddata.limit：fielddata的内存限制，默认60%
indices.breaker.request.limit：执行聚合的内存限制，默认40%
indices.breaker.total.limit：综合上面两个，限制在70%以内

3 fielddata预加载机制以及序号标记预加载

1、fielddata预加载

POST /test_index/_mapping/test_type
{
"properties": {
"test_field": {
"type": "string",
"fielddata": {
"loading" : "eager"
}
}
}
}

query-time的fielddata生成和加载到内存，变为index-time，建立倒排索引的时候，会同步生成fielddata并且加载到内存中来，这样的话，对分词field的聚合性能当然会大幅度增强

2、序号标记预加载

global ordinal原理解释

doc1: status1
doc2: status2
doc3: status2
doc4: status1

有很多重复值的情况，会进行global ordinal标记

status1 --> 0
status2 --> 1

doc1: 0
doc2: 1
doc3: 1
doc4: 0

建立的fielddata也会是这个样子的，这样的好处就是减少重复字符串的出现的次数，减少内存的消耗

POST /test_index/_mapping/test_type
{
"properties": {
"test_field": {
"type": "string",
"fielddata": {
"loading" : "eager_global_ordinals"
}
}
}
}

4 海量bucket优化机制：从深度优先到广度优先

每个演员的评论的数量 --> 每个演员的每个电影的评论的数量

评论数量排名前10个的演员 --> 每个演员的电影取到评论数量排名前5的电影

{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10,
"collect_mode" : "breadth_first"
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "films",
"size" : 5
}
}
}
}
}
}

"collect_mode" : "breadth_first" 指定在广度优先，在前10名的演员前，查看每个演员的前5个电影

Elesticsearch(es)聚合搜索（入门到精通）4相关推荐

Google搜索从入门到精通
一篇不错的google搜索运用文章: from http://lilybbs.net/vd555635/main2.html [本篇全文] [回复本文] [本篇作者: 54tw] [本篇人气: 617 ...
Google搜索从入门到精通【转】
1.前言我是在2000年上半年知道Google的.在这之前,我搜索英文信息通常用AltaVista,而搜索中文信息则常用Sina.但自使用了Google之后,它便成为我的Favorite Searc ...
Google入门到精通(搜索方法经典)
来自http://space.cenet.org.cn/user1/1267/4879.html 1,前言我是在2000年上半年知道Google的.在这之前,我搜索英文信息通常用AltaVista, ...
GOOGLE搜索从入门到精通v3.0 from：http://www.being.org.cn/tool/google.htm
GOOGLE搜索从入门到精通v3.0 原文:http://www.lasg.ac.cn/docs/googlebook.html 作者:donquix 内容 1,前言 2,摘要 3,如何使用本文 4, ...
[推荐]GOOGLE搜索从入门到精通v3.0
作者:donquix (donquix@sina.com) ------------------------------ 内容 1,前言 2,摘要 3,如何使用本文 4,GOOGLE简介 5,搜索入门 ...
GOOGLE搜索从入门到精通v3.0
内容 1,前言 2,摘要 3,如何使用本文 4,GOOGLE简介 5,搜索入门 6,初阶搜索 6.1,搜索结果要求包含两个及两个以上关键字 6.2,搜索结果要求不包含某些特定信息 6.3,搜索结果至少 ...
Elasticsearch(038)：es中搜索之入门与分页搜索
ES中搜索的方式分为两种,一种是通过URL参数进行搜索:另一种是通过POST中body请求参数进行搜索. 针对第一种我们进行简单的讲解,我们关注的核心放在第二种上面. 一.简单搜索(search) 1 ...
ElasticSearch第一讲：ElasticSearch从入门到精通
ElasticSearch第一讲:ElasticSearch从入门到精通业内目前来说事实上的一个标准,就是分布式搜索引擎一般大家都用elasticsearch.本文是ElasticSearch第一讲 ...
Elasticsearch7从入门到精通(简介、部署、原理、开发、ELK)
Elasticsearch7从入门到精通(简介.部署.原理.开发.ELK) 第1章.Elasticsearch简介 1-1.Elasticsearch介绍 Elasticsearch官方网站:http ...
2018大数据学习路线从入门到精通
最近很多人问小编现在学习大数据这么多,他们都是如何学习的呢.很多初学者在萌生向大数据方向发展的想法之后,不免产生一些疑问,应该怎样入门?应该学习哪些技术?学习路线又是什么?今天小编特意为大家整理了一份 ...

Elesticsearch(es)聚合搜索（入门到精通）4

Elesticsearch(es)聚合搜索（入门到精通）4相关推荐

最新文章

热门文章