ES学习笔记八-聚合搜索

ES中的聚合搜索可以理解为关系型数据库中的group by,将具有相同条件的数据分组，并分析每一组数据的不同表现。

high-level concepts

要理解什么是聚合查询(统计) 需要了解下边的两个重要的概念。

Buckets

Collections of documents that meet a criterion 符合条件的一组数据

Metrics

Statistics calculated on the documents in a bucket 在这组数据中进行统计计算

GET /cars/transactions/_search?search_type=count{    "aggs" : {  这是一个聚合查询        "colors" : {  此聚合查询的名字(自己定义)            "terms" : {              "field" : "color" 定义聚合条件。以color分组            }        }    }}

You’ll notice that we used the count search_type. Because we don’t care about search results—the aggregation totals—the count search_type will be faster because it omits the fetch phase.

在讲query 执行时，elasticsearch会分为两个阶段,query阶段，fetch阶段。我们并不需要查询结果，只需要知道统计结果，所以省去了fetch阶段，search_type=count使聚合查询更高效

{...   "hits": {      "hits": []  没有数据是因为我们search_type=count 并没有fetch阶段   },   "aggregations": {      "colors": {  你定义的聚合查询的名字         "buckets": [            {               "key": "red", 红色分组               "doc_count": 4  符合此条件的文档数            },            {               "key": "blue",               "doc_count": 2            },            {               "key": "green",               "doc_count": 2            }         ]      }   }}

adding a metric to the mix

GET /cars/transactions/_search?search_type=count{   "aggs": {      "colors": {         "terms": {            "field": "color"         },         "aggs": { 最外层是aggs,用来包裹住我们的统计条件            "avg_price": {  统计名称               "avg": {                  "field": "price" 我们将计算每组的price平均值               }            }         }      }   }}

buckets inside buckets

分组数据的嵌套，group by color,make 先按 color分组，再按make分组

GET /cars/transactions/_search?search_type=count{   "aggs": {      "colors": {         "terms": {            "field": "color"         },         "aggs": {             "avg_price": {  注意它的顺序。他统计的平均值，是紧接的上一个条件的统计值               "avg": {                  "field": "price"               }            },            "make": {                 "terms": {                    "field": "make"                }            }         }      }   }}

one final modification

GET /cars/transactions/_search?search_type=count{   "aggs": {      "colors": {         "terms": {            "field": "color"         },         "aggs": {            "avg_price": { "avg": { "field": "price" }            },            "make" : {                "terms" : {                    "field" : "make"                },                "aggs" : {  添加第二个聚合统计 统计的是以color和make分组后的数据                    "min_price" : { "min": { "field": "price"} },  最低价格                    "max_price" : { "max": { "field": "price"} } 最高价格                }            }         }      }   }}

building bar charts 创建柱形图

{   "aggs":{      "price":{         "histogram":{             "field": "price",            "interval": 20000 间隔2000 所得出来的结果是[0-19999,20000-399999,40000-59999,60000-79999]         },         "aggs":{            "revenue": {               "sum": {                  "field" : "price"               }             }         }      }   }}

As you can see, our query is built around the price aggregation, which contains a histogrambucket. This bucket requires a numeric field to calculate buckets on, and an interval size. The interval defines how "wide" each bucket is. An interval of 20000 means we will have the ranges [0-19999, 20000-39999, ...].

If search is the most popular activity in Elasticsearch, building date histograms must be the second most popular.Why would you want to use a date histogram?

GET /cars/transactions/_search?search_type=count{   "aggs": {      "sales": {         "date_histogram": {            "field": "sold",            "interval": "month",             "format": "yyyy-MM-dd"          }      }   }}

returning empty buckets

Yep, that’s right. We are missing a few months! By default, the date_histogram (and histogram too) returns only buckets that have a nonzero document count.

某些月份缺失了，因为没有数据，但更多的时候我们需要显示，即使没有数据。

GET /cars/transactions/_search?search_type=count{   "aggs": {      "sales": {         "date_histogram": {            "field": "sold",            "interval": "month",            "format": "yyyy-MM-dd",            "min_doc_count" : 0,  既然全部的月份都显示出来了为什么还要定义min_doc_count呢？原因：but by default Elasticsearch will return only buckets that are between the minimum and maximum value in your data.默认只返回最大值最小值啊            "extended_bounds" : {  this parameter forces the entire year to be returned 全部的月份都要显示出来                "min" : "2014-01-01",                "max" : "2014-12-31"            }         }      }   }}

extended example

GET /cars/transactions/_search?search_type=count{   "aggs": {      "sales": {         "date_histogram": {            "field": "sold",            "interval": "quarter",            "format": "yyyy-MM-dd",            "min_doc_count" : 0,            "extended_bounds" : {                "min" : "2014-01-01",                "max" : "2014-12-31"            }         },         "aggs": {            "per_make_sum": {               "terms": {                  "field": "make"               },               "aggs": {                  "sum_price": {                     "sum": { "field": "price" }                   }               }            },            "total_sum": {               "sum": { "field": "price" }            }         }      }   }}

scoping aggregations

GET /cars/transactions/_search  {    "query" : {        "match" : {            "make" : "ford"        }    },    "aggs" : {        "colors" : {            "terms" : {              "field" : "color"            }        }    }}

query与aggs是同级别的

global bucket

GET /cars/transactions/_search?search_type=count{    "query" : {        "match" : {            "make" : "ford"        }    },    "aggs" : {        "single_avg_price": {            "avg" : { "field" : "price" }  all doc match ford        },        "all": {            "global" : {},  global bucket has no parameters            "aggs" : {                "avg_price": {                    "avg" : { "field" : "price" } 这个操作针对所有的数据，而不是match ford的数据                }

            }        }    }}

filtered query

GET /cars/transactions/_search?search_type=count{    "query" : {        "filtered": {            "filter": {                "range": {                    "price": {                        "gte": 10000                    }                }            }        }    },    "aggs" : {        "single_avg_price": {            "avg" : { "field" : "price" }        }    }}

filter bucket

{   "query":{      "match": {         "make": "ford"      }   },   "aggs":{      "recent_sales": {         "filter": {  把filter用在aggs里。            "range": {               "sold": {                  "from": "now-1M"               }            }         },         "aggs": {            "average_price":{               "avg": {                  "field": "price"  计算即符合match 又符合filter的price 平均值               }            }         }      }   }}

post filter

You may be thinking to yourself, "hmm…is there a way to filter just the search results but not the aggregation?" The answer is to use a post_filter.

这个filter只对查询数据有效，对聚合操作无效，请使用post_filter

GET /cars/transactions/_search?search_type=count{    "query": {        "match": {            "make": "ford"        }    },    "post_filter": {            "term" : {            "color" : "green"        }    },    "aggs" : {        "all_colors": {            "terms" : { "field" : "color" }        }    }}

recap

重点回顾

在filtered中的filter 即会影响搜索结果，也会影响聚合结果

在aggs种的filter 只会影响聚合结果

在query中的post_filter只会影响搜索结果。

sorting multivalue buckets

对聚合结果进行排序，默认按照每个聚合结果中的doc_count降序排序。

intrinsic sorts

GET /cars/transactions/_search?search_type=count{    "aggs" : {        "colors" : {            "terms" : {              "field" : "color",              "order": {                "_count" : "asc"  按照doc_count 升序排序              }            }        }    }}

We introduce an order object into the aggregation, which allows us to sort on one of several values:

_count

Sort by document count. Works with terms, histogram, date_histogram.

_term

Sort by the string value of a term alphabetically. Works only with terms.

_key

Sort by the numeric value of each bucket’s key (conceptually similar to _term). Works only with histogram and date_histogram.

sorting by a metric

GET /cars/transactions/_search?search_type=count{    "aggs" : {        "colors" : {            "terms" : {              "field" : "color",              "order": {                "avg_price" : "asc"               }            },            "aggs": {                "avg_price": {                    "avg": {"field": "price"}                 }            }        }    }}

GET /cars/transactions/_search?search_type=count{    "aggs" : {        "colors" : {            "terms" : {              "field" : "color",              "order": {                "stats.variance" : "asc"              }            },            "aggs": {                "stats": {                    "extended_stats": {"field": "price"}This lets you override the sort order with any metric, simply by referencing the name of the metric. Some metrics, however, emit multiple values. The extended_stats metric is a good example: it provides half a dozen individual metrics.                }            }        }    }}

sorting based on "deep" metrics

finding distinct counts

GET /cars/transactions/_search?search_type=count{    "aggs" : {        "distinct_colors" : {            "cardinality" : {              "field" : "color"            }        }    }}