ElasticSearch自定义pinyin和ik分词库
目录
- 1 语料库映射OpenAPI
- 1.1 定义索引(映射)接口
- 1.2 定义索引(映射)实现
- 1.3 新增控制器
- 1.4 开始新增映射
- 2 语料库文档OpenAPI
- 2.1 定义批量新增文档接口
- 2.2 定义批量新增文档实现
- 2.3 定义批量新增文档控制器
- 2.4 开始批量新增调用
1 语料库映射OpenAPI
环境准备:
- 先下载ik分词和pinyin分词,并放到esplugins相应目录中
请求kibana:GET /_cat/plugins?v&s=component&h=name,component,version,description
结果
name component version description
WIN-A5KARTU1A65 analysis-ik 7.10.1 IK Analyzer for Elasticsearch
WPhvS8c analysis-pinyin 7.10.1 Pinyin Analysis for Elasticsearch
- 定义ik分词后的pinyin分词器,即定义一个自定义分词器ik_pinyin_analyzer
PUT test_index
{"settings":{"number_of_shards":"1","index.refresh_interval":"15s","index":{"analysis":{"analyzer":{"ik_pinyin_analyzer":{"type":"custom","tokenizer":"ik_smart","filter":"pinyin_filter"}},"filter":{"pinyin_filter":{"type":"pinyin","keep_first_letter": false}}}}}
}
下面的目的就是用API实现这种效果
这里tokenizer使用ik分词,分词之后将分词结果通过pinyin再filter一次,这样就可以了。
测试一下
POST test_index/_analyze
{"analyzer": "ik_pinyin_analyzer","text":"测试"
}
结果
{"tokens": [{"token": "ce","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 0},{"token": "shi","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 1}]
}
这样,当我们建立index的mapping的时候,就可以像使用ik_smart分词器一样使用ik_pinyin_analyzer
比如lawbasis字段的mapping可以是这样的
PUT test_index/_mapping/test_type
{"properties": {"lawbasis":{"type": "text","analyzer": "ik_smart","search_analyzer": "ik_smart","fields": {"my_pinyin":{"type":"text","analyzer": "ik_pinyin_analyzer","search_analyzer": "ik_pinyin_analyzer"}}}}
}
其中field满足以不同的目的以不同的方式为相同的字段编制索引,也就是说lawbasis这个field会以中文ik_smart分词以及分词后的pinyin分词来编制索引,并支持中文和拼音搜索。
- 测试一下
加入两条数据
POST test_index/test_type
{"lawbasis":"测试一下"
}
POST test_index/test_type
{"lawbasis":"测试东西"
}
使用拼音搜索
GET test_index/test_type/_search
{"query":{"match": {"lawbasis.my_pinyin": "ceshi"}}
}
可以看到有两条结果
1.1 定义索引(映射)接口
package com.oldlu.service;
import com.oldlu.commons.pojo.CommonEntity;
import org.elasticsearch.rest.RestStatus;
import java.util.List;
import java.util.Map;
/**
* @Class: ElasticsearchIndexService
* @Package com.oldlu.service
* @Description: 索引操作接口
* @Company: oldlu
*/
public interface ElasticsearchIndexService {//新增索引+映射boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception;
}
1.2 定义索引(映射)实现
/**
* @Class: ElasticsearchIndexServiceImpl
* @Package com.oldlu.service.impl
* @Description: 索引操作实现类
* @Company: oldlu
*/
@Service("ElasticsearchIndexServiceImpl")
public class ElasticsearchIndexServiceImpl implements ElasticsearchIndexService
{@Resourceprivate RestHighLevelClient client;private static final int START_OFFSET = 0;private static final int MAX_COUNT = 5;/** @Description: 新增索引+setting+映射+自定义分词器pinyin* setting可以为空(自定义分词器pinyin在setting中)* 映射可以为空* @Method: addIndexAndMapping* @Param: [commonEntity]* @Update:* @since: 1.0.0* @Return: boolean**/public boolean addIndexAndMapping(CommonEntity commonEntity) throws
Exception {//设置setting的mapMap<String, Object> settingMap = new HashMap<String, Object>();//创建索引请求CreateIndexRequest request = new
CreateIndexRequest(commonEntity.getIndexName());//获取前端参数Map<String, Object> map = commonEntity.getMap();//循环外层的settings和mappingfor (Map.Entry<String, Object> entry : map.entrySet()) {if ("settings".equals(entry.getKey())) {if (entry.getValue() instanceof Map && ((Map)
entry.getValue()).size() > 0) {request.settings((Map<String, Object>) entry.getValue());}}if ("mapping".equals(entry.getKey())) {if (entry.getValue() instanceof Map && ((Map)
entry.getValue()).size() > 0) {request.mapping((Map<String, Object>) entry.getValue());}}}//创建索引操作客户端IndicesClient indices = client.indices();//创建响应对象CreateIndexResponse response = indices.create(request,
RequestOptions.DEFAULT);//得到响应结果return response.isAcknowledged();}}
1.3 新增控制器
package com.oldlu.controller;
import com.oldlu.commons.enums.ResultEnum;
import com.oldlu.commons.enums.TipsEnum;
import com.oldlu.commons.pojo.CommonEntity;
import com.oldlu.commons.result.ResponseData;
import com.oldlu.service.ElasticsearchIndexService;
import org.apache.commons.lang.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
/**
* @Class: ElasticsearchIndexController
* @Package com.oldlu.controller
* @Description: 索引操作控制器
* @Company: oldlu
*/
@RestController
@RequestMapping("v1/indices")
public class ElasticsearchIndexController {private static final Logger logger = LoggerFactory.getLogger(ElasticsearchIndexController.class);@AutowiredElasticsearchIndexService elasticsearchIndexService;/** @Description: 新增索引、映射* @Method: addIndex* @Param: [commonEntity]* @Update:* @since: 1.0.0* @Return: com.oldlu.commons.result.ResponseData**/@PostMapping(value = "/add")public ResponseData addIndexAndMapping(@RequestBody CommonEntity
commonEntity) {//构造返回数据ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName())) {rData.setResultEnum(ResultEnum.PARAM_ISNULL);return rData;}//增加索引是否成功boolean isSuccess = false;try {//通过高阶API调用增加索引方法isSuccess =
elasticsearchIndexService.addIndexAndMapping(commonEntity );//构建返回信息通过类型推断自动装箱(多个参数取交集)rData.setResultEnum(isSuccess, ResultEnum.SUCCESS, 1);//日志记录logger.info(TipsEnum.CREATE_INDEX_SUCCESS.getMessage());} catch (Exception e) {//日志记录logger.error(TipsEnum.CREATE_INDEX_FAIL.getMessage(), e);//构建错误返回信息rData.setResultEnum(ResultEnum.ERROR);}return rData;}
}
1.4 开始新增映射
http://172.17.0.225:8888/v1/indices/add
或者
http://127.0.0.1:8888/v1/indices/add
参数
自定义分词器ik_pinyin_analyzer(ik和pinyin组合分词器)
tips 在创建映射前,需要安装拼音插件
{"indexName": "product_completion_index","map": {"settings": {"number_of_shards": 1,"number_of_replicas": 2,"analysis": {"analyzer": {"ik_pinyin_analyzer": {"type": "custom","tokenizer": "ik_smart","filter": "pinyin_filter"}},"filter": {"pinyin_filter": {"type": "pinyin","keep_first_letter": true,"keep_separate_first_letter": false,"keep_full_pinyin": true,"keep_original": true,"limit_first_letter_length": 16,"lowercase": true,"remove_duplicated_term": true}}}},"mapping": {"properties": {"name": {"type": "keyword"},"searchkey": {"type": "completion","analyzer": "ik_pinyin_analyzer"}}}}
}
settings下面的为索引的设置信息,动态设置参数,遵循DSL写法
mapping下为映射的字段信息,动态设置参数,遵循DSL写法
返回
{"code": "200","desc": "操作成功!","data": true
}
2 语料库文档OpenAPI
2.1 定义批量新增文档接口
package com.oldlu.service;
import com.oldlu.commons.pojo.CommonEntity;
import org.elasticsearch.action.DocWriteResponse;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.rest.RestStatus;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.suggest.completion.CompletionSuggestion;
import java.util.List;
import java.util.Map;
/**
* @Class: ElasticsearchDocumentService
* @Package com.oldlu.service
* @Description: 文档操作接口
* @Company:
*/
public interface ElasticsearchDocumentService {//批量新增文档public RestStatus bulkAddDoc(CommonEntity commonEntity) throws Exception;
}
2.2 定义批量新增文档实现
/** @Description: 批量新增文档,可自动创建索引、自动创建映射* @Method: bulkAddDoc* @Param: [indexName, map]* @Update:* @since: 1.0.0* @Return: org.elasticsearch.rest.RestStatus**/@Overridepublic RestStatus bulkAddDoc(CommonEntity commonEntity) throws Exception {//通过索引构建批量请求对象BulkRequest bulkRequest = new BulkRequest(commonEntity.getIndexName());//循环前台list文档数据for (int i = 0; i < commonEntity.getList().size(); i++) {bulkRequest.add(new IndexRequest().source(XContentType.JSON,
SearchTools.mapToObjectGroup(commonEntity.getList().get(i))));}//执行批量新增BulkResponse bulkResponse = client.bulk(bulkRequest,
RequestOptions.DEFAULT);return bulkResponse.status();}
官方文档介绍https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.4/java-rest-high-document-bulk.html
如上图,需要定义成箭头中的形式
所以上面SearchTools.mapToObjectGroup将map转成了数组
2.3 定义批量新增文档控制器
/** @Description: 批量新增文档,可自动创建索引、自动创建映射* @Method: bulkAddDoc* @Param: [indexName, map]* @Update:* @since: 1.0.0* @Return: org.elasticsearch.rest.RestStatus**/@PostMapping(value = "/batch")public ResponseData bulkAddDoc(@RequestBody CommonEntity commonEntity) {//构造返回数据ResponseData rData = new ResponseData();if (StringUtils.isEmpty(commonEntity.getIndexName()) ||
CollectionUtils.isEmpty(commonEntity.getList())) {rData.setResultEnum(ResultEnum.PARAM_ISNULL);return rData;}//批量新增操作返回结果RestStatus result = null;try {//通过高阶API调用批量新增操作方法result = elasticsearchDocumentService.bulkAddDoc(commonEntity);//通过类型推断自动装箱(多个参数取交集)rData.setResultEnum(result, ResultEnum.SUCCESS, null);//日志记录logger.info(TipsEnum.BATCH_CREATE_DOC_SUCCESS.getMessage());} catch (Exception e) {//日志记录logger.info(TipsEnum.BATCH_CREATE_DOC_FAIL.getMessage(), e);//构建错误返回信息rData.setResultEnum(ResultEnum.ERROR);}return rData;}
2.4 开始批量新增调用
http://172.17.0.225:8888/v1/docs/batch
或者
http://127.0.0.1:8888/v1/docs/batch
参数
定义23个suggest词库(定义了两个小米手机,验证是否去重)
{"indexName": "product_completion_index","list": [{"searchkey": "小米手机","name": "小米(MI)"},{"searchkey": "小米10","name": "小米(MI)"},{"searchkey": "小米电视","name": "小米(MI)"},{"searchkey": "小米路由器","name": "小米(MI)"},{"searchkey": "小米9","name": "小米(MI)"},{"searchkey": "小米手机",
"name": "小米(MI)"},{"searchkey": "小米耳环","name": "小米(MI)"},{"searchkey": "小米8","name": "小米(MI)"},{"searchkey": "小米10Pro","name": "小米(MI)"},{"searchkey": "小米笔记本","name": "小米(MI)"},{"searchkey": "小米摄像头","name": "小米(MI)"},{"searchkey": "小米电饭煲","name": "小米(MI)"},{"searchkey": "小米充电宝","name": "小米(MI)"},{"searchkey": "adidas男鞋","name": "adidas男鞋"},{"searchkey": "adidas女鞋","name": "adidas女鞋"},{"searchkey": "adidas外套","name": "adidas外套"},{"searchkey": "adidas裤子","name": "adidas裤子"},{"searchkey": "adidas官方旗舰店","name": "adidas官方旗舰店"},{"searchkey": "阿迪达斯袜子","name": "阿迪达斯袜子"},{"searchkey": "阿迪达斯外套","name": "阿迪达斯外套"},{"searchkey": "阿迪达斯运动鞋","name": "阿迪达斯运动鞋"},{"searchkey": "耐克外套","name": "耐克外套"},{"searchkey": "耐克运动鞋","name": "耐克运动鞋"}]
}
返回
{"code": "200","desc": "操作成功!","data": "OK"
}
查看GET product_completion_index/_search
ElasticSearch自定义pinyin和ik分词库相关推荐
- 使用Docker 安装Elasticsearch、Elasticsearch-head、IK分词器 和使用
使用Docker 安装Elasticsearch.Elasticsearch-head.IK分词器 和使用 原文:使用Docker 安装Elasticsearch.Elasticsearch-head ...
- ElasticSearch自定义词库
由于网络词语层出不穷,ik分词器有时并不能完全识别网络词汇,如下: 按照网络词语,王者荣耀应该被识别为一个词语,而不是被拆分成2个. 所以这时需要自定义词库来解决以上问题. 自定义词库 自定义扩展词库 ...
- elasticsearch 自定义_id
elasticsearch 自定义ID: curl -s -XPUT localhost:9200/web -d ' {"mappings": {"blog": ...
- ElasticSearch 6.0.0 IK分词 Kibana 6.0.0
ElasticSearch 6.0.0 & IK分词 & Kibana 6.0.0 1. 安装ES 6.0.0 docker run -itd -p 9200:9200 -p 9300 ...
- 【Elasticsearch】Elasticsearch自定义评分的N种方法
1.概述 首先参考文章:[Elasticsearch]Elasticsearch 相关度评分 TF&IDF 然后转载文章:实战 | Elasticsearch自定义评分的N种方法 2.三个问题 ...
- 【Es】ElasticSearch 自定义分词器
1.分词器 转载:https://blog.csdn.net/gwd1154978352/article/details/83343933 分词器首先看文章:[Elasticsearch]Elasti ...
- 本地elasticsearch中文分词器 ik分词器安装及使用
ElasticSearch 内置了分词器,如标准分词器.简单分词器.空白词器等.但这些分词器对我们最常使用的中文并不友好,不能按我们的语言习惯进行分词. ik分词器就是一个标准的中文分词器.它可以根据 ...
- Elasticsearch 自定义分词同义词环节的这个细节不大好理解......
1.问题引出 球友认证考试前一天晚上提问: 扩展背景描述: 这是 Elasticsearch 自定义分词 Text analysis 章节 Token filter reference 小节的 同义词 ...
- ElasticSearch 中文分词器ik的安装、测试、使用、自定义词库、热更新词库
文章目录 # 实验环境 # ik分词器的下载.安装.测试 ## 安装方法一:使用elasticsearch-plugin 安装 ## 安装方法二:下载编译好的包进行安装 1.下载 2.安装 3.重启` ...
最新文章
- c语言字符串转64位哈希值,对字符串进行hash处理用什么方法好???
- 韩寒:一个产品经理的自我修养
- AT4437-[AGC028C]Min Cost Cycle【结论,堆】
- JavaScript之继承(原型链)
- oracle使用cgi吗_php架构之CGI、FastCGI、php-fpm有什么关系?原来这么简单
- C/C++与Fortan混编
- 安装linux出现分区出错,找到了linux分区顺序错乱修复方法
- app商城源码_淘客多商城系统开发 APP软件开发 源码搭建
- css 实现一个带尖角的正方形
- 04HTML5学习之网页设计
- 安全宝冯景辉:每周都有超过100G大型DDoS攻击
- linux cp并打包目录,【linux】【qt5】【将linux下的qt程序打包发布(完全适用于中标麒麟)】...
- 三相SVPWM逆变器MATLAB仿真实验,三相SVPWM逆变电路MATLAB仿真.doc
- 安卓Bmob后端云的使用(增删改查、上传图片、推送服务等)
- 学会保护自己的眼睛!
- 高等数学-函数的四个性质
- 你不必使用弱引用以避免内存泄漏
- jdk能否卸载干净?可以的
- 通用智能基础模型假说
- 考研英语 | 17天搞定GRE单词