引言

搜索提示是搜索框一个比较基础的功能,他赋予了搜索框生命,提高了用户的搜索体验。本文通过仿写 boss 直聘首页职位公司搜索,来实现一个自己搜索提示功能。

需求分析

搜索提示的情况比较多,比如根据拼音全拼、拼音首字母、中文等等,我们看看 boss 直聘 的搜索提示是怎么做的。

中文前缀

中文中缀

拼音全拼前缀

需要注意的是这里不管是 shanghai 还是 shangha 都能提示出 上海

拼音全拼中缀

拼音首字母前缀

拼音首字母中缀

拼音全拼 + 中文

实现分析

本次实现中文搜索提示功能分 4 步走:

  1. 同步数据到 ElasticSearch:通过自己封装 CDC 框架,同步 MySQL 数据库数据到 ElasticSearch
  2. ElasticSearch 索引设计:设计支持中文搜索的 mapping
  3. ElasticSearch DSL 编写
  4. 代码实现

同步数据到 Elastic Search

我们想要实现搜索,首先需要将 MySQL 中的存量数据和增量数据同步到ES 中,目前常用的做法是通过 CDC (Change Data Capture)

CDC 简介

什么是 CDC

CDC是Change Data Capture(变更数据获取)的简称。核心思想是,监测并捕获数据库的变动(包括数据或数据表的插入、更新以及删除等),将这些变更按发生的顺序完整记录下来,写入到消息中间件中以供其他服务进行订阅及消费。

CDC 使用场景

CDC 的种类

实现CDC即捕获数据库的变更数据有两种机制

基于查询实现CDC 基于日志实现CDC
典型产品 Sqoop、DataX等 Canal、Debezium等
执⾏模式 批处理 流处理
捕获所有数据变化 NO YES
低延迟 NO YES
不增加数据库负载 NO YES
不侵⼊业务(不需要lastUpdate字段) NO YES
捕获删除事件 NO YES
捕获旧记录的状态 NO YES

对比常见的开源 CDC 方案

封装 CDC 框架

关于为什么我要自己封装 CDC 框架

提到同步 MySQL 数据到 ES,肯定最先想到的就是 Canal,我也是使用了一段时间,他给我的感觉就是功能很全很强大,但是比较复杂也比较难用,他的主要配置文件有两个非常的长,Java 客户端也不是很易用,想要完全搞懂需要下一番功夫,而且现在也已经停止维护了。

我其实对 CDC 的要求很简单不需要特别复杂的功能,只要数据库数据发生了 增删改 你通知我就行了,具体要同步到 ES、Redis 、消息队列 由我决定。

偶然的机会我接触到了 Flink CDC ,项目地址:https://github.com/ververica/flink-cdc-connectors。官方提供了一个演示的例子,只通过一个方法竟然就可以完成 CDC,我表示很震惊,例子如下:

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;public class MySqlSourceExample {public static void main(String[] args) throws Exception {MySqlSource<String> mySqlSource = MySqlSource.<String>builder().hostname("yourHostname").port(yourPort).databaseList("yourDatabaseName") // set captured database.tableList("yourDatabaseName.yourTableName") // set captured table.username("yourUsername").password("yourPassword").deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String.build();StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// enable checkpointenv.enableCheckpointing(3000);env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")// set 4 parallel source tasks.setParallelism(4).print().setParallelism(1); // use parallelism 1 for sink to keep message orderingenv.execute("Print MySQL Snapshot + Binlog");}
}

通过后续的了解,我知道了 Flink CDC 有一整套的生态,性能很高,功能也非常丰富,在大数据、数据仓储等领域非常常见。

但是为了一个简单的 CDC 去学一整套 Flink CDC 生态,属实有点力不从心,但好在如果只是简单的监听 MySQL 数据的 增删改 事件还是比较简单的,通过上面的例子就可以做到,于是我想基于这个例子封装一个简单易用的 CDC 框架 – easy-flink-cdc

easy-flink-cdc

就不讲怎么封装了吧,直接看看怎么用,项目地址:https://github.com/Maskvvv/easy-flink-cdc

概念说明

application

对应一个 Spring 项目

Flink Job

通过配置文件配置要监听的数据源,一个数据源对应一个 Flink JobFlink Job 可以监听多个数据库和数据库表的数据变化

sink

Flink Job 收到数据变化的结果会调用其下面的 sink,在 sink 中你可以对 增删改 事件进行自由的业务代码处理,到底是同步到 ES 中,还是 Redis 中,还是 消息队列中等等,你可以自由决定。

cursor

每个 Flink Job 都有一个自己的 cursor,他记录着每个 Flink Job 当前同步 binlog 的位置,用来在 CDC 项目重新启动是接着上一次同步的位置,继续同步数据。

# cursor 数据结构
- application2- 端口号- meta.dat- flink job cursor- flink job cursor
- application2- 端口号- meta.dat- flink job cursor- flink job cursor

使用

通过下面这 4 步你就可以轻松实现对 MySQL 的 CDC。

引入依赖
<dependency><groupId>com.easy-flink-cdc</groupId><artifactId>easy-flink-cdc-boot-starter</artifactId><version>1.0-SNAPSHOT</version>
</dependency>
编写配置文件

resources 路径下新建一个 easy-flink.conf 文件,语法为 typesafe.config

ourea = {name = "ourea"hostname = "myserver.com"port = "3308"databaseList = "ourea"tableList = "ourea.company,ourea.occupation"username = "root"password = "1234567788"startupMode = "INITIAL"
}athena = {name = "athena"hostname = "myserver.com"port = "3308"databaseList = "ourea"tableList = "ourea.sort"username = "root"password = "1234567788"startupMode = "INITIAL"
}
  • name:于根名保持一致,一个根名对应着一个 Flink Job,不允许重名。
  • hostname:需要监听的数据库域名
  • port:需要监听的数据库端口号
  • databaseList:需要监听的库名,多个用 , 分开
  • tableList:需要监听的表名,多个用 , 分开
  • username:数据库账号
  • password:数据库密码
  • startupMode:启动方式,如果有 cursor 存在,以 cursor 中优先。
    • INITIAL: 初始化快照,即全量导入后增量导入(检测更新数据写入)
    • LATEST: 只进行增量导入(不读取历史变化)
    • TIMESTAMP: 指定时间戳进行数据导入(大于等于指定时间错读取数据)
启用 easy-flink-cdc

application.properties

easy-flink.enable=true
easy-flink.meta-model=file

启动类

@EasyFlinkScan("com.esflink.demo.sink")
@SpringBootApplication
public class EasyFlinkCdcDemoApplication {public static void main(String[] args) {SpringApplication.run(EasyFlinkCdcDemoApplication.class, args);}
}

@EasyFlinkScan 注解指定 sink 类的存放路径,可以指定多个。

编写 sink
@FlinkSink(value = "ourea", database = "ourea", table = "ourea.company")
public class DemoSink implements FlinkJobSink {@Overridepublic void invoke(DataChangeInfo value, Context context) throws Exception {}@Overridepublic void insert(DataChangeInfo value, Context context) throws Exception {}@Overridepublic void update(DataChangeInfo value, Context context) throws Exception {}@Overridepublic void delete(DataChangeInfo value, Context context) throws Exception {}@Overridepublic void handleError(DataChangeInfo value, Context context, Throwable throwable) {}
}

FlinkJobSink 接口

这里你需要实现 FlinkJobSink 接口并按照你的需求重写对应事件的方法。

  • insert()update()delete() 方法:分别对应着 增、改、删 事件
  • invoke() 方法:增、改、删 事件都会调用改方法
  • handleError(): 用来处理方法调用时出现的异常

@FlinkSink

当然你还要通过 @FlinkSink 注解标识这是一个 sink,该注解有以下属性

  • value:用来指定该 sink 是属于哪个 Flink Job,必须
  • database:用来指定接收 Flink Job 中的哪些 数据库 的数据变化,默认为 Flink Job 中指定的,选填
  • database:用来指定接收 Flink Job 的哪些 表 的数据变化,不填则为 Flink Job 中指定的,选填

存在问题

总体上来讲封装一个简单易用的 CDC 框架这个目的已经基本达到了,但是由于自己是第一次封装框架,该框架还存在着许多问题,比如:

  • 框架不够模块化
  • 框架类分包混乱(主要我不知道咋分)
  • 框架可拓展性不高,比如自定义拓展序列化方式、自定义配置文件加载方式等
  • cursor 的记录不支持现在现在主流分布式的特性,现在是通过先写内存,再定时刷盘的方式记录 cursor 的,后续规划支持通过 zookeeper 记录 cursor
  • 配置文件的加载不支持分布式特性,现在只能加载本地配置文件,后续规划支持通过 nacos-config
  • 项目启动时会如果有指定 cursor,会短暂阻塞数据库,所以建议指定从库进行监听
  • 自已对 Flink CDC 的了解还不够深刻,可能有些情况还没考虑到
  • 不保证 crash safe,需要做好代码的幂等性
  • 同步性能方面,自己没有做过海量数据同步的测试,我是大概8000条数据同步到 ES 大约几分钟吧
  • 当前框架还没有上传到 Maven 的中心仓库(等我在完善完善,再说吧)

自己写框架才知道,一个(好的)框架是多么难写,对于现阶段的我来说也算尽力了,我这个 CDC 框架就当抛砖引玉吧

ElasticSearch 索引设计

数据来源的问题解决了,现在就是设计一个可以支持中文搜索提示的 ES 索引了。

我们要想实现功能齐全的搜索提示,就需要自定义分词器了。

自定义分词器的设计与测试

中文前缀分词器

索引

GET /_analyze
{"tokenizer": {"type": "edge_ngram","min_gram": 1,"max_gram": 50},"text": ["北京字节跳动"]
}

搜索

GET /_analyze
{"tokenizer": "keyword","text": ["北京字节跳动"]
}

结果

# 索引
{"tokens" : [{"token" : "北","start_offset" : 0,"end_offset" : 1,"type" : "word","position" : 0},{"token" : "北京","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 1},{"token" : "北京字","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 2},{"token" : "北京字节","start_offset" : 0,"end_offset" : 4,"type" : "word","position" : 3},{"token" : "北京字节跳","start_offset" : 0,"end_offset" : 5,"type" : "word","position" : 4},{"token" : "北京字节跳动","start_offset" : 0,"end_offset" : 6,"type" : "word","position" : 5}]
}# 搜索
{"tokens" : [{"token" : "北京字节跳动","start_offset" : 0,"end_offset" : 12,"type" : "word","position" : 0}]
}

中文中缀分词器

索引

GET /_analyze
{"tokenizer": "standard","filter": ["lowercase"], "text": ["北京字节跳动"]
}

搜索

同索引

结果

{"tokens" : [{"token" : "北","start_offset" : 0,"end_offset" : 1,"type" : "<IDEOGRAPHIC>","position" : 0},{"token" : "京","start_offset" : 1,"end_offset" : 2,"type" : "<IDEOGRAPHIC>","position" : 1},{"token" : "字","start_offset" : 2,"end_offset" : 3,"type" : "<IDEOGRAPHIC>","position" : 2},{"token" : "节","start_offset" : 3,"end_offset" : 4,"type" : "<IDEOGRAPHIC>","position" : 3},{"token" : "跳","start_offset" : 4,"end_offset" : 5,"type" : "<IDEOGRAPHIC>","position" : 4},{"token" : "动","start_offset" : 5,"end_offset" : 6,"type" : "<IDEOGRAPHIC>","position" : 5}]
}

拼音全拼前缀分词器

索引

GET /_analyze
{"tokenizer": {"type": "edge_ngram","min_gram": 1,"max_gram": 50},"filter": [{"type": "pinyin","keep_original": false,"keep_first_letter": false,"keep_full_pinyin": false,"keep_joined_full_pinyin": true,"keep_none_chinese_together": true,"keep_none_chinese_in_joined_full_pinyin": true,"none_chinese_pinyin_tokeniz": false,"keep_none_chinese": false,"ignore_pinyin_offset": false}],"text": ["北京字节跳动"]
}

搜索

GET /_analyze
{"tokenizer": "keyword","filter": [{"type": "pinyin","keep_original": false,"keep_first_letter": false,"keep_full_pinyin": false,"keep_joined_full_pinyin": true,"keep_none_chinese_together": true,"keep_none_chinese_in_joined_full_pinyin": true,"none_chinese_pinyin_tokeniz": false,"keep_none_chinese": false,"ignore_pinyin_offset": false}],"text": ["北京"]
}

结果

# 索引
{"tokens" : [{"token" : "bei","start_offset" : 0,"end_offset" : 1,"type" : "word","position" : 0},{"token" : "beijing","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 1},{"token" : "beijingzi","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 2},{"token" : "beijingzijie","start_offset" : 0,"end_offset" : 4,"type" : "word","position" : 3},{"token" : "beijingzijietiao","start_offset" : 0,"end_offset" : 5,"type" : "word","position" : 4},{"token" : "beijingzijietiaodong","start_offset" : 0,"end_offset" : 6,"type" : "word","position" : 5}]
}# 搜索
{"tokens" : [{"token" : "beijing","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 0}]
}

拼音全拼中缀分词器

索引

GET /_analyze
{"tokenizer": {"type": "pinyin","keep_original": false,"keep_first_letter": false,"keep_full_pinyin": true,"none_chinese_pinyin_tokeniz": false,"ignore_pinyin_offset": false},"text": ["北京字节跳动"]
}

搜索

GET /_analyze
{"tokenizer": "keyword","filter": [{"type": "pinyin","keep_original": false,"keep_first_letter": false,"keep_full_pinyin": true,"none_chinese_pinyin_tokeniz": false,"ignore_pinyin_offset": false}],"text": ["北京"]
}

结果

# 索引
{"tokens" : [{"token" : "bei","start_offset" : 0,"end_offset" : 1,"type" : "word","position" : 0},{"token" : "jing","start_offset" : 1,"end_offset" : 2,"type" : "word","position" : 1},{"token" : "zi","start_offset" : 2,"end_offset" : 3,"type" : "word","position" : 2},{"token" : "jie","start_offset" : 3,"end_offset" : 4,"type" : "word","position" : 3},{"token" : "tiao","start_offset" : 4,"end_offset" : 5,"type" : "word","position" : 4},{"token" : "dong","start_offset" : 5,"end_offset" : 6,"type" : "word","position" : 5}]
}# 搜索
{"tokens" : [{"token" : "bei","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 0},{"token" : "jing","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 1}]
}

拼音首字母前缀分词器

索引

GET /_analyze
{"tokenizer": {"type": "edge_ngram","min_gram": 1,"max_gram": 50},"filter": [{"type": "pinyin","keep_original": false,"keep_full_pinyin": false,"limit_first_letter_length": 50,"none_chinese_pinyin_tokeniz": false,"keep_none_chinese": false,"ignore_pinyin_offset": false}],"text": ["北京字节跳动"]
}

搜索

GET /_analyze
{"tokenizer": "keyword","filter": [{"type": "pinyin","keep_original": false,"keep_full_pinyin": false,"limit_first_letter_length": 50,"none_chinese_pinyin_tokeniz": false,"keep_none_chinese": false,"ignore_pinyin_offset": false}],"text": ["北京"]
}

结果

# 索引
{"tokens" : [{"token" : "b","start_offset" : 0,"end_offset" : 1,"type" : "word","position" : 0},{"token" : "bj","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 1},{"token" : "bjz","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 2},{"token" : "bjzj","start_offset" : 0,"end_offset" : 4,"type" : "word","position" : 3},{"token" : "bjzjt","start_offset" : 0,"end_offset" : 5,"type" : "word","position" : 4},{"token" : "bjzjtd","start_offset" : 0,"end_offset" : 6,"type" : "word","position" : 5}]
}# 搜索
{"tokens" : [{"token" : "bj","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 0}]
}

拼音首字母中缀分词器

索引

GET /_analyze
{"tokenizer": {"type": "pinyin","keep_original": false,"keep_separate_first_letter": true,"keep_first_letter": false,"keep_full_pinyin": false,"none_chinese_pinyin_tokeniz": false,"ignore_pinyin_offset": false},"text": ["北京字节跳动"]
}

搜索

GET /_analyze
{"tokenizer": "keyword","filter": [{"type": "pinyin","keep_original": false,"keep_separate_first_letter": true,"keep_first_letter": false,"keep_full_pinyin": false,"none_chinese_pinyin_tokeniz": false,"ignore_pinyin_offset": false}],"text": ["北京"]
}

结果

# 索引
{"tokens" : [{"token" : "b","start_offset" : 0,"end_offset" : 1,"type" : "word","position" : 0},{"token" : "j","start_offset" : 1,"end_offset" : 2,"type" : "word","position" : 1},{"token" : "z","start_offset" : 2,"end_offset" : 3,"type" : "word","position" : 2},{"token" : "j","start_offset" : 3,"end_offset" : 4,"type" : "word","position" : 3},{"token" : "t","start_offset" : 4,"end_offset" : 5,"type" : "word","position" : 4},{"token" : "d","start_offset" : 5,"end_offset" : 6,"type" : "word","position" : 5}]
}# 搜索
{"tokens" : [{"token" : "b","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 0},{"token" : "j","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 1}]
}

索引构建

PUT /ourea-home-suggestion-v15
{"settings": {"analysis": {"analyzer": {"lowercase_standard": {"tokenizer": "standard","filter": "lowercase"},"prefix_index_analyzer": {"tokenizer": "edge_ngram_tokenizer"},"full_pinyin_index_analyzer": {"tokenizer": "full_pinyin_tokenizer"},"full_pinyin_prefix_index_analyzer": {"tokenizer": "edge_ngram_tokenizer","filter": ["full_pinyin_prefix_filter"]},"first_letter_prefix_index_analyzer": {"tokenizer": "edge_ngram_tokenizer","filter": ["first_letter_prefix_filter"]},"first_letter_index_analyzer": {"tokenizer": "first_letter_tokenizer"},"full_pinyin_search_analyzer": {"tokenizer": "keyword","filter": ["full_pinyin_filter"]},"full_pinyin_prefix_search_analyzer": {"tokenizer": "keyword","filter": ["full_pinyin_prefix_filter"]},"first_letter_prefix_search_analyzer": {"tokenizer": "keyword","filter": ["first_letter_prefix_filter"]},"first_letter_search_analyzer": {"tokenizer": "keyword","filter": ["first_letter_filter"]}},"tokenizer": {"edge_ngram_tokenizer": {"type": "edge_ngram","min_gram": 1,"max_gram": 50},"full_pinyin_tokenizer": {"type": "pinyin","keep_original": false,"keep_first_letter": false,"keep_full_pinyin": true,"none_chinese_pinyin_tokeniz": false,"ignore_pinyin_offset": false},"first_letter_tokenizer": {"type": "pinyin","keep_original": false,"keep_separate_first_letter": true,"keep_first_letter": false,"keep_full_pinyin": false,"none_chinese_pinyin_tokeniz": false,"ignore_pinyin_offset": false}},"filter": {"full_pinyin_filter": {"type": "pinyin","keep_original": false,"keep_first_letter": false,"keep_full_pinyin": true,"none_chinese_pinyin_tokeniz": false,"ignore_pinyin_offset": false},"full_pinyin_prefix_filter": {"type": "pinyin","keep_original": false,"keep_first_letter": false,"keep_full_pinyin": false,"keep_joined_full_pinyin": true,"keep_none_chinese_together": true,"keep_none_chinese_in_joined_full_pinyin": true,"none_chinese_pinyin_tokeniz": false,"keep_none_chinese": false,"ignore_pinyin_offset": false},"edge_ngram_filter": {"type": "edge_ngram","min_gram": 1,"max_gram": 50},"first_letter_filter": {"type": "pinyin","keep_original": false,"keep_separate_first_letter": true,"keep_first_letter": false,"keep_full_pinyin": false,"none_chinese_pinyin_tokeniz": false,"ignore_pinyin_offset": false},"first_letter_prefix_filter": {"type": "pinyin","keep_original": false,"keep_full_pinyin": false,"limit_first_letter_length": 50,"none_chinese_pinyin_tokeniz": false,"keep_none_chinese": false,"ignore_pinyin_offset": false}}}},"mappings": {"properties": {"name": {"type": "text","analyzer": "keyword","fields": {"standard": {"type": "text","analyzer": "lowercase_standard"},"prefix": {"type": "text","analyzer": "prefix_index_analyzer"},"full_pinyin": {"type": "text","analyzer": "full_pinyin_index_analyzer","search_analyzer": "full_pinyin_search_analyzer","fields": {"prefix": {"type": "text","analyzer": "full_pinyin_prefix_index_analyzer","search_analyzer": "full_pinyin_prefix_search_analyzer"}}},"first_letter": {"type": "text","analyzer": "first_letter_index_analyzer","search_analyzer": "first_letter_search_analyzer","fields": {"prefix": {"type": "text","analyzer": "first_letter_prefix_index_analyzer","search_analyzer": "first_letter_prefix_search_analyzer"}}}}},"status": {"type": "short"},"type": {"type": "short"},"top": {"type": "short"},"onlined": {"type": "short"},"sequence": {"type": "double"}}}
}

中文搜索提示主要是对名称的搜索提示,所以这里给 name 属性增加了许多子字段用来支撑多种情况的搜索。

ElasticSearch DSL 编写

索引建好了 搜索的 DSL 编写也是十分重要的。

这里需要注意的是,不管是 拼音全拼前缀分词器 还是 拼音全拼分词器,都没有独自办法实现 shangh => 上海前缀搜索提示的同时还高亮显示,所以这里需要将这俩个分词器结合使用。

GET /ourea-home-suggestion/_search
{"query": {"bool": {"filter": [{"term": {"onlined": {"value": 1,"boost": 1}}}],"should": [{"term": {"name.prefix": {"value": "上海","boost": 10}}},{"match_phrase": {"name.standard": {"query": "上海","slop": 0,"zero_terms_query": "NONE","boost": 5}}},{"bool": {"filter": [{"match_phrase_prefix": {"name.full_pinyin.prefix": {"query": "上海","analyzer": "full_pinyin_prefix_search_analyzer","slop": 0,"max_expansions": 100,"zero_terms_query": "NONE","boost": 1}}}],"should": [{"match_phrase_prefix": {"name.full_pinyin": {"query": "上海","analyzer": "full_pinyin_search_analyzer","slop": 0,"max_expansions": 50,"zero_terms_query": "NONE","boost": 1}}}],"adjust_pure_negative": true,"minimum_should_match": "1","boost": 3}},{"match_phrase_prefix": {"name.full_pinyin": {"query": "上海","analyzer": "full_pinyin_search_analyzer","slop": 0,"max_expansions": 50,"zero_terms_query": "NONE","boost": 1.5}}},{"match": {"name.first_letter.prefix": {"query": "上海","operator": "OR","analyzer": "first_letter_prefix_search_analyzer","prefix_length": 0,"max_expansions": 100,"fuzzy_transpositions": true,"lenient": false,"zero_terms_query": "NONE","auto_generate_synonyms_phrase_query": true,"boost": 1}}},{"match_phrase": {"name.first_letter": {"query": "上海","analyzer": "first_letter_search_analyzer","slop": 0,"zero_terms_query": "NONE","boost": 0.8}}}],"adjust_pure_negative": true,"minimum_should_match": "1","boost": 1}},"highlight": {"type": "plain","fields": {"name.prefix": {},"name.standard": {},"name.full_pinyin": {},"name.first_letter.prefix": {},"name.first_letter": {}}}
}

这里解释一下为什么没有使用 Completion suggester 方式

Completion suggester 是 ES 专门为前缀匹配设计的数据类型,他会将 completion 类型的数据加载到内存中,性能非常高,但是他也存在如下几个问题,无法满足我们的需求:

  • 只支持前缀匹配,没办法实现中缀匹配
  • 没办法在搜索时指定分词器
  • 不能过滤结果
  • 不支持高亮

代码实现

有了前面的铺垫代码实现显得格外的简单。

数据同步到 ES

这里我们需要同步职位表和公司表的数据到 ES,写两个 sink 就可以了。

CompanySink

@FlinkSink(value = "ourea", database = "ourea", table = "ourea.company")
public class CompanySink implements FlinkJobSink {@Autowired(required = false)private OureaHomeSuggestionDocMapper homeSuggestionDocMapper;@Overridepublic void update(DataChangeInfo value, Context context) throws Exception {String afterData = value.getAfterData();OureaHomeSuggestionDoc homeSuggestionDoc = JSON.parseObject(afterData, OureaHomeSuggestionDoc.class);homeSuggestionDoc.setType(1);homeSuggestionDoc.setOnlined(1);homeSuggestionDocMapper.insert(homeSuggestionDoc);}@Overridepublic void insert(DataChangeInfo value, Context context) throws Exception {String afterData = value.getAfterData();OureaHomeSuggestionDoc homeSuggestionDoc = JSON.parseObject(afterData, OureaHomeSuggestionDoc.class);homeSuggestionDoc.setType(1);homeSuggestionDoc.setOnlined(1);homeSuggestionDocMapper.insert(homeSuggestionDoc);}@Overridepublic void delete(DataChangeInfo value, Context context) throws Exception {OureaHomeSuggestionDoc homeSuggestionDoc = JSON.parseObject(value.getBeforeData(), OureaHomeSuggestionDoc.class);homeSuggestionDocMapper.deleteById(homeSuggestionDoc.getId());}
}

OccupationSink

@FlinkSink(value = "ourea", database = "ourea", table = "ourea.occupation")
public class OccupationSink implements FlinkJobSink {@Autowired(required = false)private OureaHomeSuggestionDocMapper homeSuggestionDocMapper;@Overridepublic void update(DataChangeInfo value, Context context) throws Exception {String afterData = value.getAfterData();OureaHomeSuggestionDoc homeSuggestionDoc = JSON.parseObject(afterData, OureaHomeSuggestionDoc.class);homeSuggestionDoc.setType(2);homeSuggestionDocMapper.insert(homeSuggestionDoc);}@Overridepublic void insert(DataChangeInfo value, Context context) throws Exception {String afterData = value.getAfterData();OureaHomeSuggestionDoc homeSuggestionDoc = JSON.parseObject(afterData, OureaHomeSuggestionDoc.class);homeSuggestionDoc.setType(2);homeSuggestionDocMapper.insert(homeSuggestionDoc);}@Overridepublic void delete(DataChangeInfo value, Context context) throws Exception {OureaHomeSuggestionDoc homeSuggestionDoc = JSON.parseObject(value.getBeforeData(), OureaHomeSuggestionDoc.class);homeSuggestionDocMapper.deleteById(homeSuggestionDoc.getId());}
}

这里我们为了简便用了一个开源的 ES ORM 框架 Easy-Es ,类似于 MyBatis-Plus,简单的增删改用它就不用使用 RestHighLevelClient 写一大堆代码了。详细请看官网官网:https://www.easy-es.cn/

搜索 - 服务端

这里由于搜索比较复杂,还是使用了原生的RestHighLevelClient 编写,没有用 Easy-Es。

Controller

这里为了方便业务逻辑就直接写在 Controller 里了

@RestController
@RequestMapping("ourea_home_v2")
public class OureaHomeSuggestiongV2Controller {@Autowired(required = false)private CompanyDocumentMapper companyDocumentMapper;@GetMapping("suggest")private List<OureaHomeSuggestionModel> getCompanies(String key) throws IOException {SearchRequest searchRequest = new SearchRequest("ourea-home-suggestion");List<OureaHomeSuggestionModel> result = new ArrayList<>();String[] highlightFieldName = {"name.prefix", "name.standard", "name.full_pinyin", "name.first_letter.prefix", "name.first_letter"};query(key, searchRequest);highlight(searchRequest, highlightFieldName);SearchResponse search = companyDocumentMapper.search(searchRequest, RequestOptions.DEFAULT);SearchHit[] hits = search.getHits().getHits();for (SearchHit hit : hits) {String sourceAsString = hit.getSourceAsString();OureaHomeSuggestionModel homeSuggestionModel = JSON.parseObject(sourceAsString, OureaHomeSuggestionModel.class);result.add(homeSuggestionModel);Map<String, HighlightField> highlightFields = hit.getHighlightFields();for (String hfName : highlightFieldName) {HighlightField hf = highlightFields.get(hfName);if (hf == null) continue;Text[] fragments = hf.getFragments();homeSuggestionModel.setHighlight(fragments[0].toString());break;}}return result;}private void highlight(SearchRequest searchRequest, String[] highlightField) {HighlightBuilder highlightBuilder = new HighlightBuilder();for (String field : highlightField) {highlightBuilder.field(field).highlighterType("plain");}searchRequest.source().highlighter(highlightBuilder);}private void query(String key, SearchRequest searchRequest) {BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery().minimumShouldMatch(1);boolQueryBuilder.filter(QueryBuilders.termQuery("onlined", 1));List<QueryBuilder> should = boolQueryBuilder.should();// 中文前缀should.add(QueryBuilders.termQuery("name.prefix", key).boost(10));// 中文中缀should.add(QueryBuilders.matchPhraseQuery("name.standard", key).boost(5f));// 拼音全拼前缀BoolQueryBuilder fullPinyinPrefixBoolQueryBuilder = new BoolQueryBuilder();should.add(fullPinyinPrefixBoolQueryBuilder);fullPinyinPrefixBoolQueryBuilder.minimumShouldMatch(1);fullPinyinPrefixBoolQueryBuilder.boost(3);fullPinyinPrefixBoolQueryBuilder.filter(QueryBuilders.matchPhrasePrefixQuery("name.full_pinyin.prefix", key).analyzer("full_pinyin_prefix_search_analyzer").maxExpansions(100));fullPinyinPrefixBoolQueryBuilder.should().add(QueryBuilders.matchPhrasePrefixQuery("name.full_pinyin", key).analyzer("full_pinyin_search_analyzer"));// 拼音全拼中缀should.add(QueryBuilders.matchPhrasePrefixQuery("name.full_pinyin", key).analyzer("full_pinyin_search_analyzer").boost(1.5f));// 拼音首字母前缀should.add(QueryBuilders.matchQuery("name.first_letter.prefix", key).analyzer("first_letter_prefix_search_analyzer").maxExpansions(100).boost(1));// 拼音首字母中缀should.add(QueryBuilders.matchPhraseQuery("name.first_letter", key).analyzer("first_letter_search_analyzer").boost(0.8f));searchRequest.source().query(boolQueryBuilder);}
}

Model

public class OureaHomeSuggestionModel {/*** id*/private String id;/*** 职位/公司名称*/private String name;/*** 职位:状态 0未提交 1未审核 2已通过 3已驳回* 公司:状态 0未认证 1待审核 2已认证 3未通过*/private Integer status;/*** 1 企业 2 职位*/private Integer type;/*** 是否置顶*/private Integer top;/*** 职位是否上线 1上线 0下线*/private Integer onlined;/*** 排序*/private Double sequence;private String highlight;// 省略 get set 方法}

搜索 - 前端

由于本人前端能力有限所以网上找了个搜索的例子改了改,见笑了 : )

<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>Test Baidu</title><style>* {margin: 0;padding: 0;}em {color: red;}</style><script>window.onload=function(){//获取文本输入框var textElment = document.getElementById("text");//获取下提示框var div = document.getElementById("tips");textElment.onkeyup=function(){//获取用户输入的值var text = textElment.value;//如果文本框中没有值,则下拉框被隐藏,不显示if(text==""){div.style.display="none";return;}//获取XMLHttpRequest对象var xhr = new XMLHttpRequest();//编写回调函数xhr.onreadystatechange=function(){//判断回调的条件是否准备齐全if(xhr.readyState===4){if(xhr.status===200){//取的服务器端传回的数据var str = xhr.responseText;var childs = "";//判断传回的数据是否为空,若是则直接返回,不显示if (str == "") {div.innerHTML = "<div></div>";div.style.display = "block";return;}//我们将会在服务器端把数据用 , 隔开,当然这里也可以使用jsonvar result = str.split(",");var resultJson = JSON.parse(xhr.responseText);console.log(resultJson)//遍历结果集,将结果集中的每一条数据用一个div显示,把所有的div放入到childs中for (var i = 0; i < resultJson.length; i++) {var suggest = resultJson[i];childs += "<div style='border-bottom: 1px solid pink' οnclick='Write(this)' οnmοuseοut='recoverColorwhenMouseout(this)' οnmοuseοver='changeColorwhenMouseover(this)'>"+ suggest.highlight + (suggest.type === 1 ? "(企业)" : "(职位)")+ "</div>";}//把childs 这div集合放入到下拉提示框的父div中,上面我们以获取了div.innerHTML = childs;div.style.display = "block";}}}//创建与服务器的连接xhr.open("GET", "ourea_home_v2/suggest?key=" +  encodeURI(text).replace(/\+/g,'%2B'));//发送xhr.send();}}//鼠标悬停时改变div的颜色function changeColorwhenMouseover(div){div.style.backgroundColor="blue";}//鼠标移出时回复div颜色function recoverColorwhenMouseout(div){div.style.backgroundColor="";}//当鼠标带点击div时,将div的值赋给输入文本框function Write(div){//将div中的值赋给文本框document.getElementById("text").value=div.innerHTML;//让下拉提示框消失div.parentNode.style.display="none";}</script>
</head><body>
<!--文本输入框
-->
<div id="serach" style="margin-left: 500px"><input type="text" name="text" id="text"  /><input type="submit" value="搜索" />
</div><!--提示下拉框
-->
<div id="tips" style="display: none;width: 300px; border: 1px solid pink; margin-left: 500px"; >
</div></body>
</html>

测试

现在看看我们做的搜索提示是不是满足我们的要求

中文前缀

中文中缀

拼音全拼前缀

拼音全拼中缀

中文 + 拼音全拼 前缀

中文 + 拼音全拼 中缀

中文 + 拼音首字母 前缀

中文 + 拼音首字母 中缀

存在问题

由于拼音分词器会把符号过滤掉比如 + ,所以如果搜索 C++ 这种会出现问题,临时没找到比较好的解决方法。

总结

写到这里本文也接近尾声了,本次中文搜索提示的实战,算是对自己这段时间 Spring 框架 和 Elastic Search 学习的一个检验和总结,虽然还有许多问题,但搜索提示基本满足了我的要求。好就这样,祝大家生活愉快。

参考资料

  • https://www.elastic.co/guide/en/elasticsearch/reference/8.8
  • https://github.com/medcl/elasticsearch-analysis-pinyin
  • https://juejin.cn/post/7206487695123513403
  • https://time.geekbang.org/course/intro/100030501
  • https://blog.csdn.net/UbuntuTouch/article/details/100697156
  • https://elasticstack.blog.csdn.net/article/details/100526099
  • https://github.com/ververica/flink-cdc-connectors
  • https://developer.aliyun.com/article/984320
  • https://juejin.cn/post/6844903605967781902
  • https://blog.csdn.net/A_Story_Donkey/article/details/81244338
  • https://www.easy-es.cn/
  • https://blog.csdn.net/qq_22130209/article/details/110000579

Elastic Search 中文拼音搜索补全实战相关推荐

  1. 输入框根据拼音首字母/中文字符联想补全

    输入框根据拼音首字母/中文字符联想补全 背景 工具 实现 H5 MySQL 创建中文转拼音函数 创建拼音缓存表&添加索引 创建触发器(同步project到拼音缓存表中) 查询语句 背景 在H5 ...

  2. jqueryUI+ajax实现百度类似的搜索补全下拉

    先看一下百度的搜索提示样例: 下面我们直接来看代码实现: <!doctype html> <html lang="en"> <head>< ...

  3. Elasticsearch生产实战(ik分词器、拼音分词、自动补全、自动纠错)

    目录 一.IK分词器 1.IK分词器介绍 2.安装 3.使用 4.自定义词库 二.拼音分词器 1.拼音分词器介绍 2.安装 三.自动补全 1.效果演示 2.实战 四.自动纠错 1.场景描述 2.DSL ...

  4. ES 7.X 做类百度搜索,进行搜索自动补全和热搜词及拼音功能实现

    文章目录 前言 一.如何使用ES做类似百度的检索? 二.全文检索自动补齐 1.创建索引 2.添加数据 3.高级检索 三 热搜词 1.思路 2.DSL语句 3.java代码实现 四 拼音补全 1.DSL ...

  5. select2.js插件新增支持拼音搜索

    话不多说直奔主题 1.拼音搜索的原理: 将下拉框选项中的中文转换成汉语拼音,然后与输入的字母进行比较,如果包含则被检索出来. 2.效果: 3.在select2.js中找到matcher 方法,对此方法 ...

  6. Elasticsearch实战-实现Hotel索引库的自动补全、拼音搜索功能

    一.实现思路 1.修改hotel索引库结构,设置自定义拼音分词器 2.修改索引库的name.all字段,使用自定义分词器 3.索引库添加一个新字段suggestion,类型为completion类型, ...

  7. Elasticsearch 分布式搜索引擎 -- 自动补全(拼音分词器、自定义分词器、自动补全查询、实现搜索框自动补全)

    文章目录 1. 自动补全 1.1 拼音分词器 1.2.1 自定义分词器 1.2.2 小结 1.2 自动补全 1.3 实现酒店搜索框自动补全 1.3.1 修改酒店映射结构 1.3.2 修改HotelDo ...

  8. debian 10 buster 安装配置 elastic search 和 中文, 拼音分词

    debian 10 buster 安装配置 es 和 中文, 拼音分词 安装 测试 配置 分词 IK 分词器 拼音分词 一个完整的动态映射模板(包含geo, pinyin, IK) 安装 1, 安装j ...

  9. ElasticSearch实战系列02 SpringBoot + ElasticSearch 7.7 实现高仿QQ用户搜索:中文+拼音混合检索,并高亮显示

    本文导读 本文仿照QQ的用户搜索,搭建一个中文+拼音的混合检索系统,并高亮显示检索字段.全文共分为以下几部分: 1.项目简介,包括需求描述与分析等: 2.项目开发,通过两个版本的index,验证并完成 ...

最新文章

  1. PyTorch 自动微分示例
  2. Gauss-Newton算法代码详细解释(转载+自己注释)
  3. 在CentOS5.4上安装Git
  4. STM32工作笔记0069---汉字显示实验
  5. Java http处理get请求,参数中带特殊字符处理方式
  6. 人工智能目标检测模型总结(三)——yolov1模型(2)
  7. 【二叉树】二叉树遍历总结
  8. 2022-2027年中国精华素行业市场深度分析及投资战略规划报告
  9. python 采用to_csv保存数据不覆盖原有数据到文件中
  10. 开源项目eladmin--笔记
  11. 微信小程序开发之倒计时定时器
  12. 线程之间传递信息的几种方法 Android
  13. framework层的event_log分析
  14. 马宁伟-20年工作经验谈-3-十年磨一剑
  15. 电脑各种中英文信息对照及错误信息总汇 系统出错信息及解决方案
  16. 江西理工大学南昌校区2016年新生赛
  17. ChatGPT使用指南:英文简历定制新利器
  18. 在线校核计算机械在线,三排滚子转盘轴承的校核计算方法-中国机械工程.pdf
  19. HCIP(华为高级网络安全工程师)(第十四天)(MPLS协议1)
  20. PESTEL分析模型(转载)

热门文章

  1. c语言中大圣打妖怪题目,《西游记》竞赛题目100题
  2. “玲珑杯”ACM比赛 Round #18 C -- 图论你先敲完模板【Dp】
  3. hive create table: Specified key was too long; max
  4. 2022星空创造营应用创新大赛火热报名中!
  5. 人人网开心农场小帮手
  6. java坦克大战的总结_坦克大战总结(估计是马士兵的坦克大战) | 学步园
  7. Redis搭建集群时报错[ERR] Not all 16384 slots are covered by nodes.
  8. Webpack2/3配置ExtractTextPlugin和Autoprefixer
  9. 案例研究 | 初创公司 Savioke 用设计冲刺为酒店设计机器人方案
  10. html表ge模板_html css表格样式模板