分布式搜索引擎-ElasticSearch（上集）

个人简介

作者是一个来自河源的大三在校生，以下笔记都是作者自学之路的一些浅薄经验，如有错误请指正，将来会不断的完善笔记，帮助更多的Java爱好者入门。

文章目录

个人简介
分布式搜索引擎-ElasticSearch（上集）
- 什么是ElasticSearch
- ElasticSearch概念
- ElasticSearch的底层索引
- elasticsearch和关系型数据库（MySQL）
- elasticsearch的一些注意点
- - 跨域问题
  - 占用内存过多导致卡顿问题
  - elasticsearch和kibana版本问题
- ik分词器
- - ik分词器的使用
  - ik分词器分词的扩展
- elasticsearch的操作（REST风格）
- - 创建索引
  - 删除索引
  - 往索引插入数据（document）
  - 删除索引中指定的数据（根据id）
  - 修改索引中指定的数据
  - 删除索引中指定的数据
  - 创建映射字段
  - - 指定索引映射字段只能使用一次
    - 使用"_mapping"，往索引添加字段
    - 使用_reindex实现数据迁移
  - 获取索引信息
  - 获取指定索引中所有的记录（_search）
  - 获取索引指定的数据
  - 获取指定索引全部数据(match_all:{})
  - match查询(只允许单个查询条件)
  - - 如果我们再加多一个查询条件
  - 精准查询(term)和模糊查询(match)区别
  - multi_match实现类似于百度搜索
  - 短语(精准)搜索(match_phrase)
  - 指定查询显示字段(_source)
  - 排序sort

分布式搜索引擎-ElasticSearch（上集）

注意：ElasticSearch版本为7.6.1

什么是ElasticSearch

ElasticSearch是一个基于Lucene的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎，基于RESTful web接口。Elasticsearch是用Java开发的，并作为Apache许可条款下的开放源码发布，是当前流行的企业级搜索引擎。设计用于云计算中，能够达到实时搜索，稳定，可靠，快速，安装使用方便。

我们建立一个网站或应用程序，并要添加搜索功能，但是想要完成搜索工作的创建是非常困难的。我们希望搜索解决方案要运行速度快，我们希望能有一个零配置和一个完全免费的搜索模式，我们希望能够简单地使用JSON通过HTTP来索引数据，我们希望我们的搜索服务器始终可用，我们希望能够从一台开始并扩展到数百台，我们要实时搜索，我们要简单的多租户，我们希望建立一个云的解决方案。因此我们利用Elasticsearch来解决所有这些问题及可能出现的更多其它问题。摘选自《百度百科》

ElasticSearch概念

elasticsearch是一个实时的分布式全文检索引擎，elasticsearch是由Lucene作为底层构建的，elasticsearch采用的不是一般的正排索引（类似于mysql索引），而是用倒排索引，好处是模糊搜索速度极快。。。

elasticsearch的操作都是使用JSON格式发送请求的

ElasticSearch的底层索引

我们知道mysql的like可以作为模糊搜索，但是速度是很慢的，因为mysql的like模糊搜索不走索引，因为底层是正排索引，所谓的正排索引，也就是利用完整的关键字去搜索。。。。而elasticsearch的倒排索引则就是利用不完整的关键字去搜索。原因是elasticsearch利用了“分词器”去对每个document分词（每个字段都建立了一个倒排索引，除了documentid），利用分出来的每个词去匹配各个document

比如：在索引名为hello下，有三个document

documentid age name

1 18 张三

2 20 李四

3 18 李四

此时建立倒排索引：

第一个倒排索引：

age

18 1 , 3

20 2

第二个倒排索引：

name

张三 1

李四 2 , 3

elasticsearch和关系型数据库（MySQL）

我们暂且可以把es和mysql作出如下比较

mysql数据库（database） ========== elasticsearch的索引（index）

mysql的表（table）==============elasticsearch的type（类型）======后面会被废除

mysql的记录 =========== elasticsearch的文档（document）

mysql的字段 ============= elasticsearch的字段（Field）

elasticsearch的一些注意点

跨域问题

打开elasticsearch的config配置文件elasticsearch.yml

并在最下面添加如下：

http.cors.enabled: true
http.cors.allow-origin: "*"

占用内存过多导致卡顿问题

因为elasticsearch是一个非常耗资源的，从elasticsearch的配置jvm配置文件就可以看到，elasticsearch默认启动就需要分配给jvm1个g的内存。我们可以对它进行修改

打开elasticsearch的jvm配置文件jvm.options

找到：

-Xms1g    //最小内存
-Xms1g    //最大内存

修改成如下即可：

-Xms256m
-Xms512m

elasticsearch和kibana版本问题

如果在启动就报错，或者其他原因，我们要去看一看es和kibana的版本是否一致，比如es用的是7.6 ，那么kibana也要是7.6

ik分词器

ik分词器的使用

ik分词器是一种中文分词器，但是比如有一些词（例如人名）它是不会分词的，所以我们可以对它进行扩展。

要使用ik分词器，就必须下载ik分词器插件，放到elasticsearch的插件目录中，并以ik为目录名

ik分词器一共有两种分词方式：ik_smart , ik_max_word

ik_smart : 最少切分（尽可能少切分单词）

ik_max_word : 最多切分（尽可能多切分单词）

=============================

ik_smart :

GET _analyze     //  _analyze 固定写法
{"text": ["分布式搜索"],"analyzer": "ik_smart"}

ik_max_word :

GET _analyze
{"text": ["分布式搜索"],"analyzer": "ik_max_word"}

ik分词器分词的扩展

GET _analyze
{"text": ["我是张三，very nice"],"analyzer": "ik_max_word"
}

人名没有分正确。我们可以新建一个配置文件，去添加我们需要分的词

1.我们先去ik插件目录中找到IKAnalyzer.cfg.xml文件

<properties><comment>IK Analyzer 扩展配置</comment><!--用户可以在这里配置自己的扩展字典 --><entry key="ext_dict"></entry>     //如果有自己新建的dic扩展，就可以加到<entry>xxx.dic</entry><!--用户可以在这里配置自己的扩展停止词字典--><entry key="ext_stopwords"></entry><!--用户可以在这里配置远程扩展字典 --><!-- <entry key="remote_ext_dict">words_location</entry> --><!--用户可以在这里配置远程扩展停止词字典--><!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

2.创建my.dic，把自己需要分词的添加进去

比如我们想添加多“张三”这个分词，就可以在my.dic输入进去

3.重启所有服务即可

GET _analyze
{"text": ["我是张三，very nice"],"analyzer": "ik_max_word"}

{"tokens" : [{"token" : "我","start_offset" : 0,"end_offset" : 1,"type" : "CN_CHAR","position" : 0},{"token" : "是","start_offset" : 1,"end_offset" : 2,"type" : "CN_CHAR","position" : 1},{"token" : "张三","start_offset" : 2,"end_offset" : 5,"type" : "CN_WORD","position" : 2},{"token" : "very","start_offset" : 6,"end_offset" : 10,"type" : "ENGLISH","position" : 3},{"token" : "nice","start_offset" : 11,"end_offset" : 15,"type" : "ENGLISH","position" : 4}]
}

elasticsearch的操作（REST风格）

下面的操作使用Kibana作为可视化工具去操作es ,也可以使用postman去操作

method url地址描述
PUT localhost:9100/索引名称/类型名称/文档id 创建文档（指定id）
POST localhost:9100/索引名称/类型名称创建文档（随机id）
POST localhost:9100/索引名称/文档类型/文档id/_update 修改文档
DELETE localhost:9100/索引名称/文档类型/文档id 删除文档
GET localhost:9100/索引名称/文档类型/文档id 查询文档通过文档id
POST localhost:9100/索引名称/文档类型/_search 查询所有文档

可以看到，elasticsearch和原生的RESTful风格有点不同，区别是PUT和POST，原生RestFul风格的PUT是用来修改数据的，POST是用来添加数据的，而这里相反

PUT和POST的区别：

PUT具有幂等性，POST不具有幂等性，也就是说利用PUT无论提交多少次，返回结果都不会发生改变，这就是具有幂等性，而POST我们可以把他理解为uuid生成id，每一次的id都不同，所以POST不具有幂等性

创建索引

模板：PUT /索引名

例1：

创建一个索引名为hello01，类型为_doc，documentid（记录id）为001的记录，PUT一定要指定一个documentid，如果是POST的话可以不写，POST是随机给documentid的，因为post是不具有幂等性的

PUT /hello03
{//请求体，为空就是没有任何数据
}

返回结果

{"acknowledged" : true,"shards_acknowledged" : true,"index" : "hello03"
}

删除索引

DELETE hello01
{}

往索引插入数据（document）

PUT /hello03/_doc/1
{"name": "yzj","age" : 18}

结果:

{"_index" : "hello03","_type" : "_doc","_id" : "1","_version" : 1,"result" : "created","_shards" : {"total" : 2,"successful" : 1,"failed" : 0},"_seq_no" : 0,"_primary_term" : 1
}

然后我们查看一下hello03的索引信息：

{
"state": "open",
"settings": {
"index": {
"creation_date": "1618408917052",
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "OEVNL7cCQgG74KMPG5LjLA",
"version": {
"created": "7060199"
},
"provided_name": "hello03"
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"    //name的底层默认用了keyword（不可分词）
}
}
},
"age": {
"type": "long"  //age用了long
}
}
}
},
"aliases": [ ],
"primary_terms": {
"0": 1
},
"in_sync_allocations": {
"0": [
"17d4jyS9RgGEVid4rIANQA"
]
}
}

我们可以看到，如果我们没有指定字段类型，就会使用es默认提供的

例如上面的name，默认用了keyword，不可分词

所以我们很有必要在创建时就指定类型

删除索引中指定的数据（根据id）

DELETE hello01/_doc/004
{}

修改索引中指定的数据

POST hello02/_update/001
{"doc": {"d2":"Java"}}

删除索引中指定的数据

DELETE hello02/_doc/001
{}

创建映射字段

PUT /hello05
{"mappings": {"properties": {"name":{"type": "text","analyzer": "ik_max_word"},"say":{"type": "text","analyzer": "ik_max_word"}}}
}

查看一下hello05索引信息：

{
"state": "open",
"settings": {
"index": {
"creation_date": "1618410744334",
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "isCuH2wTQ8S3Yw2MSspvGA",
"version": {
"created": "7060199"
},
"provided_name": "hello05"
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"analyzer": "ik_max_word",     //说明指定字段类型成功了
"type": "text"
},
"say": {
"analyzer": "ik_max_word",
"type": "text"
}
}
}
},
"aliases": [ ],
"primary_terms": {
"0": 1
},
"in_sync_allocations": {
"0": [
"lh6O9N8KQNKtLqD3PSU-Fg"
]
}
}

指定索引映射字段只能使用一次

我们再重新往hello05索引添加mapping映射：

PUT /hello05
{"mappings": {"properties": {"name":{"type": "text","analyzer": "ik_max_word"},"say":{"type": "text","analyzer": "ik_max_word"},"age":{"type": "integer"}}}
}

然后，报错了！！！！！！

{"error" : {"root_cause" : [{"type" : "resource_already_exists_exception","reason" : "index [hello05/isCuH2wTQ8S3Yw2MSspvGA] already exists","index_uuid" : "isCuH2wTQ8S3Yw2MSspvGA","index" : "hello05"}],"type" : "resource_already_exists_exception","reason" : "index [hello05/isCuH2wTQ8S3Yw2MSspvGA] already exists","index_uuid" : "isCuH2wTQ8S3Yw2MSspvGA","index" : "hello05"},"status" : 400
}

特别注意：

在我们创建了索引映射属性后，es底层就会给我们创建倒排索引（不可以再次进行修改），但是可以添加新的字段，或者重新创建一个新索引，用reindex把旧索引的信息放到新索引里面去。

所以：我们在创建索引mapping属性的时候要再三考虑

不然，剩下没有指定的字段就只能使用es默认提供的了

使用"_mapping"，往索引添加字段

我们上面说过，mapping映射字段不能修改，但是没有说不能添加，添加的方式有一些不同。

PUT hello05/_mapping
{"properties": {"ls":{"type": "keyword"}}}

使用_reindex实现数据迁移

使用场景：当mapping设置完之后发现有几个字段需要“修改”，此时我们可以先创建一个新的索引，然后定义好字段，然后把旧索引的数据全部导入进新索引

POST _reindex
{"source": {"index": "hello05","type": "_doc"}, "dest": {"index": "hello06"}}

#! Deprecation: [types removal] Specifying types in reindex requests is deprecated.
{"took" : 36,"timed_out" : false,"total" : 5,"updated" : 0,"created" : 5,"deleted" : 0,"batches" : 1,"version_conflicts" : 0,"noops" : 0,"retries" : {"bulk" : 0,"search" : 0},"throttled_millis" : 0,"requests_per_second" : -1.0,"throttled_until_millis" : 0,"failures" : [ ]
}

获取索引信息

GET hello05
{}

获取指定索引中所有的记录（_search）

GET hello05/_search
{"query": {"match_all": {}}
}

获取索引指定的数据

GET hello05/_doc/1
{
}

获取指定索引全部数据(match_all:{})

GET hello05/_search
{
}

和上面的是一样的

GET hello05/_search
{"query": {"match_all": {}}}

match查询(只允许单个查询条件)

match查询是可以把查询条件进行分词的。

GET hello05/_search
{"query": {"match": {"name": "李"   //查询条件}}
}

{"took" : 1,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.9395274,"hits" : [{"_index" : "hello05","_type" : "_doc","_id" : "2","_score" : 0.9395274,"_source" : {"name" : "李四","age" : 3}},{"_index" : "hello05","_type" : "_doc","_id" : "4","_score" : 0.79423964,"_source" : {"name" : "李小龙","age" : 45}}]}
}

如果我们再加多一个查询条件

GET hello05/_search
{"query": {"match": {"name": "李", "age": 45}}}

就会报错，原因是match只允许一个查询条件，多条件可以用query bool must 来实现

{"error" : {"root_cause" : [{"type" : "parsing_exception","reason" : "[match] query doesn't support multiple fields, found [name] and [age]","line" : 6,"col" : 18}],"type" : "parsing_exception","reason" : "[match] query doesn't support multiple fields, found [name] and [age]","line" : 6,"col" : 18},"status" : 400
}

精准查询(term)和模糊查询(match)区别

match:

GET hello05/_search
{"query": {"match": {"name": "李龙"}}}

{"took" : 0,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 2.0519087,"hits" : [{"_index" : "hello05","_type" : "_doc","_id" : "4","_score" : 2.0519087,"_source" : {"name" : "李小龙","age" : 45}},{"_index" : "hello05","_type" : "_doc","_id" : "2","_score" : 0.9395274,"_source" : {"name" : "李四","age" : 3}}]}
}

**==================**

term :

GET hello05/_search
{"query": {"term": {"name": "李龙"}}}

{"took" : 0,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 0,"relation" : "eq"},"max_score" : null,"hits" : [ ]}
}

区别是：

1：match的查询条件是会经过分词器分词的，然后再去和倒排索引去对比（对比term效率较低）

2：term的查询条件是不会分词的，是直接拿去和倒排索引去对比的，效率较高

3:同样term也是只能支持一个查询条件的

multi_match实现类似于百度搜索

match和multi_match的区别在于match只允许传入的数据在一个字段上搜索，而multi_match可以在多个字段中搜索

例如：我们要实现输入李小龙，然后在title字段和content字段中搜索，就要用到multi_match，普通的match不可以

模拟京东搜索商品

PUT /goods
{"mappings": {"properties": {"title":{"analyzer": "standard","type" : "text"},"content":{"analyzer": "standard","type": "text"}}}}

GET goods/_search
{"query": {//下面输入华为，会进行分词，然后在title和content两个字段中搜索"multi_match": {"query": "华为","fields": ["title","content"]}}}

{"took" : 1,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 1.1568705,"hits" : [{"_index" : "goods","_type" : "_doc","_id" : "2","_score" : 1.1568705,"_source" : {"title" : "华为Mate30","content" : "华为Mate30 8+128G，麒麟990Soc","price" : "3998"}},{"_index" : "goods","_type" : "_doc","_id" : "1","_score" : 1.0173018,"_source" : {"title" : "华为P40","content" : "华为P40 8+256G，麒麟990Soc，贼牛逼","price" : "4999"}}]}
}

短语(精准)搜索(match_phrase)

GET goods/_search
{"query": {"match_phrase": {"content": "华为P40手机"}}}

结果查不到数据，原因是match_phrase是短语搜索，也就是精确搜索

{"took" : 0,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 0,"relation" : "eq"},"max_score" : null,"hits" : [ ]}
}

指定查询显示字段(_source)

elasticsearch默认的显示字段规则类似于MYSQL的select * from xxx ，我们可以自定义成类似于select id,name from xxx

GET goods/_search
{"query": {"multi_match": {"query": "华为","fields": ["title","content"]}}, "_source" :  ["title","content"]  //指定只显示title和content}

{"took" : 2,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 1.1568705,"hits" : [{"_index" : "goods","_type" : "_doc","_id" : "2","_score" : 1.1568705,"_source" : {"title" : "华为Mate30","content" : "华为Mate30 8+128G，麒麟990Soc"}},{"_index" : "goods","_type" : "_doc","_id" : "1","_score" : 1.0173018,"_source" : {"title" : "华为P40","content" : "华为P40 8+256G，麒麟990Soc，贼牛逼"}}]}
}

排序sort

因为前面设计索引mapping失误，price没有进行设置，导致price是text类型，无法进行排序和filter range，所以我们再添加一个字段，od

POST goods/_update/1
{"doc": {"od":1}
}

省略2 3 4

GET goods/_search
{"query": {"multi_match": {"query": "华为","fields": ["title","content"]}}, "sort": [{"od": {"order": "desc"  //asc升序，desc降序}}]}

分布式搜索引擎-ElasticSearch（上集）相关推荐

分布式搜索引擎ElasticSearch(四) -- 插件使用
2019独角兽企业重金招聘Python工程师标准>>> 分布式搜索引擎ElasticSearch(四) -- 插件使用博客分类: 搜索引擎,爬虫首先非常感谢国内大神 - Med ...
微服务03 分布式搜索引擎 elasticsearch ELK kibana RestAPI 索引库 DSL查询 RestClient 黑马旅游
分布式搜索引擎01 -- elasticsearch基础 0.学习目标 1.初识elasticsearch 1.1.了解ES 1.1.1.elasticsearch的作用 elasticsearch是 ...
SpringBoot2.x集成分布式搜索引擎Elasticsearch
参考资料: https://my.oschina.net/uwith/blog/3226665 https://www.freesion.com/article/8399663484/ https:/ ...
分布式搜索引擎ElasticSearch
总结前言: 因为最近项目开发有用到ElasticSearch,之前在去年年底也在技术交流研讨会上听过这个技术,其实听过很多次,但是每次都没有关注,直到现在用到了,就在这做一个总结,也是写一下学习成果, ...
ElasticSearch logo 分布式搜索引擎 ElasticSearch
原文来自:http://www.oschina.net/p/elasticsearch Elastic Search 是一个基于Lucene构建的开源,分布式,RESTful搜索引擎.设计用于云计算中 ...
分布式搜索引擎ElasticSearch之高级运用（三）
一.倒排索引原理 ES采用的是倒排索引(Inverted Index), 也称为反向索引. 有反向索引,也会有正向索引. 正向索引正排索引是以文档的ID作为关键字,并且记录文档中每个字段的值信息,通 ...
分布式搜索引擎Elasticsearch（一）：Elasticsearch命令
前言:本文为原创若有错误欢迎评论! linux安装elasticsearch6.5.4与windows下的kibana 请参考我的博客https://blog.csdn.net/weixin_439 ...
php使用es搜索引擎,分布式搜索引擎Elasticsearch PHP类封装使用原生api
BZOJ 4547: Hdu5171 小奇的集合 Sol 首先,考虑这个要怎么搞...让总和最大的方法就是选出当前集合中最大的两个数相加放入集合中就可以了,证明非常简单,当前集合的和为x,它的和只会一 ...
分布式搜索引擎ElasticSearch+Kibana (Marvel插件安装详解)
在安装插件的过程中,尤其是安装Marvel插件遇到了很多问题,要下载license.Marvel-agent,又要下载安装Kibana 版本需求 Java 7 or later Elasticsear ...

分布式搜索引擎-ElasticSearch（上集）

个人简介

文章目录

分布式搜索引擎-ElasticSearch（上集）

什么是ElasticSearch

ElasticSearch概念

ElasticSearch的底层索引

elasticsearch和关系型数据库（MySQL）

elasticsearch的一些注意点

跨域问题

占用内存过多导致卡顿问题

elasticsearch和kibana版本问题

ik分词器

ik分词器的使用

ik分词器分词的扩展

elasticsearch的操作（REST风格）

创建索引

删除索引

往索引插入数据（document）

删除索引中指定的数据（根据id）

修改索引中指定的数据

删除索引中指定的数据

创建映射字段

指定索引映射字段只能使用一次

使用"_mapping"，往索引添加字段

使用_reindex实现数据迁移

获取索引信息

获取指定索引中所有的记录（_search）

获取索引指定的数据

获取指定索引全部数据(match_all:{})

match查询(只允许单个查询条件)

如果我们再加多一个查询条件

精准查询(term)和模糊查询(match)区别

multi_match实现类似于百度搜索

短语(精准)搜索(match_phrase)

指定查询显示字段(_source)

排序sort

分布式搜索引擎-ElasticSearch（上集）相关推荐

最新文章

热门文章