很多应用场景下,搜索带高亮显示可以较好的改善用户体验。常用的企业搜索引擎Elasticsearch、Solr中均提供了高亮的功能。Elasticsearch、Solr中的高亮显示是均来源于lucene的高亮模块,luncene允许在一个或者多个字段上突出显示搜索内容,在中高亮方式上,lucene支持三种高亮显示方式highlighter, fast-vector-highlighter, postings-highlighter,  在solr中,highlighter 高亮是缺省配置高亮方式。在ElasticSearch中,highlighter 同样是默认的高亮方式。

highlighter

highlighter 高亮也叫plain高亮,该方式有一定的优点也有一定的缺点,先说说缺点。highlighter方式高亮是个实时分析处理高亮器。即用户在查询的时候,es取到了符合条件的docid后,将需要高亮的字段数据提取到内存,再调用该字段的分析器进行分词,分词完毕后采用相似度算法计算得分最高的前n组并高亮段返回数据。以ansj分析器为例,官方给出的性能在60-80万字/每秒,但实际上中服务器运行效率会小于该值(服务器主频都比较低),在生产环境下,ansj分词效率大多在 40-50万字/秒。假设用户搜索的都是比较大的文档同时需要进行高亮。按照一页查询40条(每条数据20k)的方式进行显示,即使相似度计算以及搜索排序不耗时,整个查询也会被高亮拖累到接近两秒,这种查询就有点无法忍受了。

highlighter的优点也是因为其是实时分析高亮器,这种实时分析机制会让ES占用较少的io资源同时也占用较少的存储空间(词库较全的话相比fvh方式能节省一半的存储空间),其采用cpu资源来缓解io压力,在高亮字段较短(比如高亮文章的标题)时候速度较快,同时因io访问的次数少,io压力较小,有利于提高系统吞吐量。

fast-vector-highlighter :
      为解决 highlighter 高亮在大文本字段上的性能问题,lucene高亮模块提供了基于向量的高亮方式 fast-vector-highlighter。要采用fast-vector-highlighter(fvh)高亮方式,在数据建索引时候,需要配置存储词向量的词位置、词偏移量。fast-vector-highlighter在高亮时候的逻辑如下:
    1.分析高亮查询语法,提取表达式中的高亮词集合
    2.从磁盘上读取该文档字段下的词向量集合
    3.遍历词向量集合,提取自表达式中出现的词向量
    4.根据提取到目标词向量读取词频信息,根据词频获取每个位置信息、偏移量
    5.通过相似度算法获取得分较高的前n组高亮信息
    6.读取字段内容(多字段用空格隔开),根据提取的词向量直接定位截取高亮字段(注意:lucene原生高亮存在bug,bug分别存在core 以及 highlighter工程中,之前我写过如何修改)
     由此可见,fast-vector-highlighter 省去了实时分析过程,但是多了磁盘读取,故fast-vector-highlighter 也有一定的优点以及缺点.

缺点:

(1)fast-vector-highlighter  高亮方式需要存储词向量,而在词库丰富的系统中,存储词向量往往要比不存储词向量多占用一倍的空间。

(2)fast-vector-highlighter  高亮会比plain高亮多出至少一倍的io操作次数,读取的字节大小也多出至少一倍,大量的io请求会让搜索引擎并发能力降低。

优点:
(1)当实时分词速度小于磁盘读随机取速度的时候,从磁盘读取词向量的fast-vector-highlighter高亮有明显优势,例如: ansj分词器处理1百万字的文档耗时约两秒,而当前企业硬盘一分钟转速约为一万转,即一秒钟有160次的寻址能力,单次寻址并读取20k耗时约为7-10ms。分40次从磁盘总共读取2M内容耗时约为300毫秒,重复读取数据时候io上存在缓存,速度较快。与plain方式相比,fvh高亮在文档字段内容较大的情况下具有较大优势。

默认plain高亮方式占用空间小,但是对大字段高亮慢,fvh对大字段高亮快,但占用空间过大,有没有一种高亮方式可以折中一些,即不要占用太大空间,对大字段分词也会太慢?当然有,lucene还提供了postings-highlighter(postings)高亮方式,postings-highlighter 高亮方式也是采用词量向量的方式进行高亮,与fvh高亮不同的是postings高亮只存储了词向量的位置信息,并未存储词向量的偏移量,故中大字段存储中,其比fvh节省约20-30%的存储空间。在实际使用中,postings高亮的优点和缺点都不突出,故高亮时候对小字段采用highlighter高亮方式,大字段采用fast-vector-highlighter即可满足需求。

目前,lucene提供的默认plain高亮方式占用空间小,但是对大文本操作速度又太慢,fvh速度快,但占用磁盘空间和io操作又太多,在生产环境下,系统的吞吐量以及存储量都无法达到一个满意的水平。为了达到空间占用与默认的高亮其相同,速度比fast-vector-highlighter 高亮速度快,根据lucene高亮器的实现结构,我自己写了个高亮器,名称为fast-highlighter

 fast-highlighter 由几部分组成:
 1.es环境调用插件 FastPlainHighlighter,用于环境变量处理。
 2.TreeAnalysis分析器,能以FieldQuery类分析出的词作为词典的高性能分析器。
 3.高亮段计算以及编解码处理类
 
 编写难点:
 1. 短语高亮(带引号高亮,短语会被分成多个词,高亮时候只有位置连续符合的才被高亮)

2. 最优化返回(需要计算最符合或者高亮词数最多的钱n段)

3. 高亮时候允许不区分大小写匹配,不区分全角半角匹配高亮

代码测试。

测试样本:

(1)单条约10K的文本数据索引(索引总大小1.5g)。

(2)检索文本中包含 “国美电器” 关键词文章并高亮返回40条

测试结果:

采用fast-vector-highlighter 高亮方式耗时336毫秒:

{

  • "took": 336,
  • "timed_out": false,
  • "_shards": {
    • "total": 1,
    • "successful": 1,
    • "failed": 0

    },

  • "hits": {
    • "total": 115,
    • "max_score": 0.19190195,
    • "hits": [
      • {

        • "_index": "test_v1",
        • "_type": "test",
        • "_id": "51000508",
        • "_score": 0.19190195,
        • "highlight": {
          • "text": [

            • "且主要品类零售额增速均高于上年同期水平。(2) 6 月 12 日,<b>国美</b><b>电器</b>宣布,其股东特别大会已通过公司更名议案。中文名称由“<b>国美</b><b>电器</b>控股有限公司”更改为“<b>国美</b>零售控股有限公司”。同日公司宣布正式推出全球首家专业 VR 影院,地点位于国美旗下大中<b>电器</b>北京马甸店。<b>国美</b> VR 影院将打破售票入场形式,采用“时间售卖”的方式,正式对外营业后一小时将收费"

            ]

          }

        }

采用fast--highlighter 高亮耗时132毫秒:

{

  • "took": 132,
  • "timed_out": false,
  • "_shards": {
    • "total": 1,
    • "successful": 1,
    • "failed": 0

    },

  • "hits": {
    • "total": 115,
    • "max_score": 0.19190195,
    • "hits": [
      • {

        • "_index": "test_v2",
        • "_type": "test",
        • "_id": "51000508",
        • "_score": 0.19190195,
        • "highlight": {
          • "text": [

            • "家用<b>电器</b>类零售额同比增长 1.6%,相比上年同期加快了 11.8 个百分点。(2) 6 月 12 日,<b>国美</b><b>电器</b>宣布,其股东特别大会已通过公司更名议案。中文名称由“<b>国美</b><b>电器</b>控股有限公司”更改为“<b>国美</b>零售控股有限公司”。同日,公司宣布正式推出全球首家专业 VR 影院,地点位于<b>国美</b>旗下大中<b>电器</b>北京马甸店。<b>国美</b>"

            ]

          }

        }

从结果可以看到,自己实现的高亮器可以比fast高亮器性能提高一倍以上,如下为fast-highlighter 的核心代码。

package org.elasticsearch.search.highlight;import com.google.common.collect.Maps;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.vectorhighlight.BoundaryScanner;
import org.apache.lucene.search.vectorhighlight.CustomFieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.SimpleBoundaryScanner;
import org.apache.lucene.search.vectorhighlight.FieldQuery.Phrase;
import org.apache.lucene.util.BytesRefHash;
import org.elasticsearch.ExceptionsHelper;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.index.mapper.FieldMapper;
import org.elasticsearch.search.fetch.FetchPhaseExecutionException;
import org.elasticsearch.search.fetch.FetchSubPhase;
import org.elasticsearch.search.internal.SearchContext;import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;/*** * @author jkuang.nj**/
public class FastPlainHighlighter implements Highlighter
{private static final String CACHE_KEY = "highlight-fast";public static final char mark = 0;private static final SimpleBoundaryScanner DEFAULT_BOUNDARY_SCANNER = new SimpleBoundaryScanner();@Overridepublic HighlightField highlight(HighlighterContext highlighterContext){SearchContextHighlight.Field field = highlighterContext.field;SearchContext context = highlighterContext.context;FetchSubPhase.HitContext hitContext = highlighterContext.hitContext;FieldMapper mapper = highlighterContext.mapper;Encoder encoder = field.fieldOptions().encoder().equals("html") ? HighlightUtils.Encoders.HTML : HighlightUtils.Encoders.DEFAULT;if (!hitContext.cache().containsKey(CACHE_KEY)){hitContext.cache().put(CACHE_KEY, new HighlighterEntry());}HighlighterEntry cache = (HighlighterEntry) hitContext.cache().get(CACHE_KEY);try{FieldQuery fieldQuery;if (field.fieldOptions().requireFieldMatch()){if (cache.fieldMatchFieldQuery == null){cache.fieldMatchFieldQuery = new CustomFieldQuery(highlighterContext.query, hitContext.topLevelReader(), true,field.fieldOptions().requireFieldMatch());}fieldQuery = cache.fieldMatchFieldQuery;}else{if (cache.noFieldMatchFieldQuery == null){cache.noFieldMatchFieldQuery = new CustomFieldQuery(highlighterContext.query, hitContext.topLevelReader(), true,field.fieldOptions().requireFieldMatch());}fieldQuery = cache.noFieldMatchFieldQuery;}if (!cache.analysises.containsKey(field.field())){cache.setPhrases(field.field(), fieldQuery.getPhrases(field.field()));cache.setWords(field.field(), fieldQuery.getTermSet(field.field()));}FastHighlighter entry = cache.mappers.get(mapper);if (entry == null){BoundaryScanner boundaryScanner = DEFAULT_BOUNDARY_SCANNER;if (field.fieldOptions().boundaryMaxScan() != SimpleBoundaryScanner.DEFAULT_MAX_SCAN|| field.fieldOptions().boundaryChars() != SimpleBoundaryScanner.DEFAULT_BOUNDARY_CHARS){boundaryScanner = new SimpleBoundaryScanner(field.fieldOptions().boundaryMaxScan(), field.fieldOptions().boundaryChars());}CustomFieldQuery.highlightFilters.set(field.fieldOptions().highlightFilter());entry = new FastHighlighter(encoder, boundaryScanner);cache.mappers.put(mapper, entry);}String[] fragments;int numberOfFragments = field.fieldOptions().numberOfFragments() == 0 ? 1 : field.fieldOptions().numberOfFragments();int fragmentCharSize = field.fieldOptions().numberOfFragments() == 0 ? 50 : field.fieldOptions().fragmentCharSize();List textsToHighlight = null;try{textsToHighlight = HighlightUtils.loadFieldValues(field, mapper, context, hitContext);StringBuilder buffer = new StringBuilder();for (Object textToHighlight : textsToHighlight){String text = textToHighlight.toString();buffer.append(text).append(" ");}fragments = entry.getBestBestFragments(cache.analysises.get(field.field()), cache.phrases.get(field.field()), buffer,numberOfFragments, fragmentCharSize, field.fieldOptions().preTags(), field.fieldOptions().postTags());}catch (Exception e){e.printStackTrace();if (ExceptionsHelper.unwrap(e, BytesRefHash.MaxBytesLengthExceededException.class) != null){return null;}else{throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e);}}if (fragments != null && fragments.length > 0){return new HighlightField(highlighterContext.fieldName, Text.convertFromStringArray(fragments));}int noMatchSize = highlighterContext.field.fieldOptions().noMatchSize();if (noMatchSize > 0 && textsToHighlight.size() > 0){String fieldContents = textsToHighlight.get(0).toString();return new HighlightField(highlighterContext.fieldName,new Text[] { new Text(fieldContents.substring(0, Math.min(fragmentCharSize, fieldContents.length()))) });}return null;}catch (Exception e){throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e);}}@Overridepublic boolean canHighlight(FieldMapper fieldMapper){return true;}private class HighlighterEntry{public FieldQuery noFieldMatchFieldQuery;public FieldQuery fieldMatchFieldQuery;public Map> phrases = new HashMap<>();public Mapmappers = Maps.newHashMap();public Mapanalysises = new HashMap<>();public void setPhrases(String field, Setphrases) { if(!this.phrases.containsKey(field)){ this.phrases.put(field, phrases); } } public void setWords(String field, Setwords) { if (!analysises.containsKey(field)) { TreeAnalysis analysis = new TreeAnalysis(); if (words != null && words.size() > 0) { for (String word : words) { analysis.add(word); } } analysises.put(field, analysis); } } } static class FragmentScore implements Comparable{ int point = 0; int distance = 0; Listterms = new ArrayList<>(); HashSetset = new HashSet<>(); StringBuffer buffer = new StringBuffer(); public FragmentScore(int distance) { this.distance = distance; } public void updateScore(Setphrases) { for (Phrase phrase : phrases) { if (buffer.indexOf(phrase.toString()) >= 0) { this.point += 5 * phrase.list.size(); } } } public boolean add(Term term) { if (terms.size() == 0 || term.pos - terms.get(0).pos <= distance) { if (terms.size() == 0) { buffer.append(term.word); } else { int dis = term.pos - terms.get(terms.size() - 1).pos; buffer.append(mark).append(dis).append(mark); buffer.append(term.word); } terms.add(term); this.point += term.length(); if (set.size() > 0) { if (!set.contains(term.word)) { this.point += 2; set.add(term.word); } if (term.pos - terms.get(terms.size() - 1).pos == 1) { this.point += 2; } } return true; } return false; } @Override public int compareTo(FragmentScore o) { return -(this.point - o.point); } } public class FastHighlighter { BoundaryScanner boundaryScanner; Encoder encoder; public FastHighlighter(Encoder encoder, BoundaryScanner boundaryScanner) { this.encoder = encoder; this.boundaryScanner = boundaryScanner; } public String[] getBestBestFragments(TreeAnalysis analyzer, Setphrases, StringBuilder buffer, int maxNumFragments, int fragmentSize, String[] preTags, String[] postTags) { ListfragmentScores; if (maxNumFragments <= 1) { fragmentScores = getBestFragments(analyzer, phrases, buffer.toString(), fragmentSize); } else { fragmentScores = getBestFragments(analyzer, buffer.toString(), maxNumFragments, fragmentSize); } return toString(buffer, fragmentSize, fragmentScores, preTags, postTags); } public String[] toString(StringBuilder buffer, int fragmentSize, ListfragmentScores, String[] preTags, String[] postTags) { Listlist = new ArrayList<>(); for (FragmentScore score : fragmentScores) { Listterms = score.terms; Term head = terms.get(0); Term tail = terms.get(terms.size() - 1); int start = boundaryScanner.findStartOffset(buffer, head.startoffset); int end = boundaryScanner.findEndOffset(buffer, tail.endoffset()); if (fragmentScores.size() == 1 && buffer.length() <= fragmentSize) { start = 0; end = buffer.length(); } else if (fragmentSize - (tail.endoffset() - head.startoffset) > (fragmentSize / 10)) { int size = fragmentSize - (tail.endoffset() - head.startoffset); if (head.startoffset < (size * 3 / 10)) { start = 0; } else { start = boundaryScanner.findStartOffset(buffer, head.startoffset); } if (buffer.length() - start < fragmentSize) { end = buffer.length(); } else { end = boundaryScanner.findEndOffset(buffer, Math.max(start + fragmentSize, tail.endoffset())); } } StringBuffer result = new StringBuffer(); for (int i = 0; i < terms.size(); i++) { Term term = terms.get(i); result.append(buffer.substring(start, term.startoffset)); result.append(getTag(preTags, i)); result.append(encoder.encodeText(buffer.substring(term.startoffset, term.endoffset()))); result.append(getTag(postTags, i)); start = term.endoffset(); } result.append(buffer.substring(start, end)); list.add(result.toString()); } return list.toArray(new String[0]); } public final ListgetBestFragments(TreeAnalysis analyzer, Setphrases, String text, int fragmentSize) { if (analyzer == null) { return new ArrayList<>(); } Listfragments = new ArrayList(); FragmentScore fragmentScore = null; Listterms = analyzer.find(text); for (int i = 0, j = 0; i < terms.size(); i++) { FragmentScore fScore = new FragmentScore(fragmentSize); for (j = i; j < terms.size(); j++) { if (!fScore.add(terms.get(j))) { break; } } fScore.updateScore(phrases); if (fragmentScore == null || fragmentScore.compareTo(fScore) >= 0) { fragmentScore = fScore; } if (j >= terms.size()) { break; } } if (fragmentScore != null) { fragments.add(fragmentScore); } return fragments; } public final ListgetBestFragments(TreeAnalysis analyzer, String text, int maxNumFragments, int fragmentSize) { if (analyzer == null) { return null; } Listterms = analyzer.find(text); Listfragments = new ArrayList(); FragmentScore fScore = new FragmentScore(fragmentSize); for (int i = 0; i < terms.size(); i++) { if (!fScore.add(terms.get(i))) { fragments.add(fScore); fScore = new FragmentScore(fragmentSize); fScore.add(terms.get(i)); } } fragments.add(fScore); Collections.sort(fragments); while (fragments.size() > maxNumFragments) { fragments.remove(fragments.size() - 1); } return fragments; } protected String getTag(String[] tags, int num) { int n = num % tags.length; return tags[n]; } } public static class Term { String word; int startoffset, pos; public Term(int startoffset, int pos, String word) { this.startoffset = startoffset; this.pos = pos; this.word = word; } public int endoffset() { return this.startoffset+word.length(); } public int length() { return word.length(); } public String toString() { return "start:" + startoffset + " pos:" + pos+" word:"+word; } } public static class TreeAnalysis { private TNode root = new TNode((char) 0, false); boolean[] nodes = new boolean[64 * 1024]; static final char ch0 ='\uFF00'; static final char ch1 ='\uFF5F'; public Listfind(String str) { int start = 0; int length = str.length(); str = str.toLowerCase(); char[] values = str.toCharArray(); Listterms = new ArrayList<>(); int sumpos = 0; while (start < length) { char ch = values[start]; //全椒字符串换为半角字符 ch= (char) (ch > ch0 && ch < ch1 ? ch - 65248 :ch); if (!nodes[ch]) { start++; continue; } else { int pos = root.find(values, start, -1); if (pos >= start) { terms.add(new Term(start, start - sumpos + terms.size(), str.substring(start, pos + 1))); sumpos += pos + 1 - start; start = pos + 1; } else { start++; } } } return terms; } public void add(String str) { if (str == null || str.length() == 0) { return; } str = str.toLowerCase(); nodes[(int)str.charAt(0)] = true; root.insert(str, 0); } private static class TNode implements Comparable{ // 标记当前节点是否是一个词的终止字符 boolean mark; // 当前节点的字符 char value; // 子节点 TNode[] nodes; int nodesize; public TNode(char ch, boolean mark) { this.value = ch; this.mark = mark; } public int find(char[] chs, int nextPos, int leafoffset) { if (nextPos >= chs.length) { return -1; } int size = 0; char ch = chs[nextPos]; //全椒字符串换为半角字符 ch= (char) (ch > ch0 && ch < ch1 ? ch - 65248 :ch); while (size < this.nodesize && nodes[size++].value < ch); int pos = nodes[size - 1].value == ch ? size - 1 : -1; // int pos = index(chs[nextPos]); if (pos >= 0) { if (nodes[pos].mark) { leafoffset = nextPos; if (nodes[pos].nodesize == 0) { return nextPos; } } int next = nodes[pos].find(chs, nextPos + 1, leafoffset); return next > leafoffset ? next : leafoffset; } else { return -1; } } /*public int index(char ch) { if (this.nodesize < 5) { int size = 0; while (size < this.nodesize && nodes[size++].value < ch) ; return nodes[size - 1].value == ch ? size - 1 : -1; } else { return indexOf(nodes, this.nodesize, ch, Type._index); } }*/ int indexOf(TNode[] nodes, int size, char node, Type type) { int fromIndex = 0; int toIndex = size - 1; while (fromIndex <= toIndex) { int mid = (fromIndex + toIndex) >> 1; int cmp = nodes[mid].compareTo(node);// this.comparator.compare(nodes[mid], // node); if (cmp < 0) fromIndex = mid + 1; else if (cmp > 0) toIndex = mid - 1; else return type == Type._insert ? -(mid + 1) : mid; // key // found } switch (type) { case _insert: return fromIndex; case _index: return -(fromIndex + 1); default: return toIndex; } } public void insert(String str, int pos) { char ch = str.charAt(pos); boolean isleaf = pos == str.length() - 1; if (this.nodesize == 0) { nodes = new TNode[1]; nodes[0] = new TNode(ch, isleaf); if (!isleaf) { nodes[0].insert(str, pos + 1); } this.nodesize++; } else { int _index = indexOf(nodes, nodesize, ch, Type._insert); if (_index >= 0) { int moved = this.nodesize - _index; if (this.nodesize == nodes.length) { nodes = Arrays.copyOf(nodes, nodes.length + 1); } if (moved > 0) { System.arraycopy(nodes, _index, nodes, _index + 1, moved); } nodes[_index] = new TNode(ch, isleaf); if (!isleaf) { nodes[_index].insert(str, pos + 1); } this.nodesize++; } else { if (isleaf) { nodes[0].mark = true; } else { nodes[-_index - 1].insert(str, pos + 1); } } } } @Override public int compareTo(TNode o) { if (this.value > o.value) { return 1; } else if (this.value < o.value) { return -1; } return 0; } public int compareTo(char o) { if (this.value > o) { return 1; } else if (this.value < o) { return -1; } return 0; } enum Type { _insert, _index } } } } /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.lucene.search.vectorhighlight; import java.io.IOException; import java.util.ArrayList; import java.util.Collection; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.LinkedHashSet; import java.util.List; import java.util.Map; import java.util.Set; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.queries.CustomScoreQuery; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.BoostQuery; import org.apache.lucene.search.ConstantScoreQuery; import org.apache.lucene.search.DisjunctionMaxQuery; import org.apache.lucene.search.FilteredQuery; import org.apache.lucene.search.MultiTermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.vectorhighlight.FieldTermStack.TermInfo; /** * FieldQuery breaks down query object into terms/phrases and keeps * them in a QueryPhraseMap structure. */ public class FieldQuery { final boolean fieldMatch; // fieldMatch==true, Map// fieldMatch==false, MapMaprootMaps = new HashMap<>(); // fieldMatch==true, Map// fieldMatch==false, MapMap> termSetMap = new HashMap<>(); //存储短语 Map> phraseMap = new HashMap<>(); int termOrPhraseNumber; // used for colored tag support // The maximum number of different matching terms accumulated from any one MultiTermQuery private static final int MAX_MTQ_TERMS = 1024; public static class Phrase { public Listlist = new ArrayList<>(); StringBuffer buffer = new StringBuffer(); public void add(String word,int position) { if(list.size()==0){ buffer.append(word); }else{ int pos= position-list.get(list.size()-1).position; char z = 0; buffer.append(z).append(pos).append(z); buffer.append(word); } list.add(new Term(position, word)); } public String toString() { return buffer.toString(); } public static class Term{ public int position; public String word; public Term(int position, String word) { this.position = position; this.word = word; } } } protected FieldQuery( Query query, IndexReader reader, boolean phraseHighlight, boolean fieldMatch ) throws IOException { this.fieldMatch = fieldMatch; SetflatQueries = new LinkedHashSet<>(); flatten( query, reader, flatQueries, 1f ); saveTerms( flatQueries, reader ); CollectionexpandQueries = expand( flatQueries ); for( Query flatQuery : expandQueries ){ QueryPhraseMap rootMap = getRootMap( flatQuery ); rootMap.add( flatQuery, reader ); float boost = 1f; while (flatQuery instanceof BoostQuery) { BoostQuery bq = (BoostQuery) flatQuery; flatQuery = bq.getQuery(); boost *= bq.getBoost(); } if( !phraseHighlight && flatQuery instanceof PhraseQuery ){ PhraseQuery pq = (PhraseQuery)flatQuery; if( pq.getTerms().length > 1 ){ for( Term term : pq.getTerms() ) rootMap.addTerm( term, boost ); } } } } /** For backwards compatibility you can initialize FieldQuery without * an IndexReader, which is only required to support MultiTermQuery */ FieldQuery( Query query, boolean phraseHighlight, boolean fieldMatch ) throws IOException { this (query, null, phraseHighlight, fieldMatch); } void flatten( Query sourceQuery, IndexReader reader, CollectionflatQueries, float boost ) throws IOException{ while (true) { if (sourceQuery.getBoost() != 1f) { boost *= sourceQuery.getBoost(); sourceQuery = sourceQuery.clone(); sourceQuery.setBoost(1f); } else if (sourceQuery instanceof BoostQuery) { BoostQuery bq = (BoostQuery) sourceQuery; sourceQuery = bq.getQuery(); boost *= bq.getBoost(); } else { break; } } if( sourceQuery instanceof BooleanQuery ){ BooleanQuery bq = (BooleanQuery)sourceQuery; for( BooleanClause clause : bq ) { if( !clause.isProhibited() ) { flatten( clause.getQuery(), reader, flatQueries, boost ); } } } else if( sourceQuery instanceof DisjunctionMaxQuery ){ DisjunctionMaxQuery dmq = (DisjunctionMaxQuery)sourceQuery; for( Query query : dmq ){ flatten( query, reader, flatQueries, boost ); } } else if( sourceQuery instanceof TermQuery ){ if (boost != 1f) { sourceQuery = new BoostQuery(sourceQuery, boost); } if( !flatQueries.contains( sourceQuery ) ) flatQueries.add( sourceQuery ); } else if( sourceQuery instanceof PhraseQuery ){ PhraseQuery pq = (PhraseQuery)sourceQuery; if( pq.getTerms().length == 1 ) sourceQuery = new TermQuery( pq.getTerms()[0] ); if (boost != 1f) { sourceQuery = new BoostQuery(sourceQuery, boost); } flatQueries.add(sourceQuery); } else if (sourceQuery instanceof ConstantScoreQuery) { final Query q = ((ConstantScoreQuery) sourceQuery).getQuery(); if (q != null) { flatten( q, reader, flatQueries, boost); } } else if (sourceQuery instanceof FilteredQuery) { final Query q = ((FilteredQuery) sourceQuery).getQuery(); if (q != null) { flatten( q, reader, flatQueries, boost); } } else if (sourceQuery instanceof CustomScoreQuery) { final Query q = ((CustomScoreQuery) sourceQuery).getSubQuery(); if (q != null) { flatten( q, reader, flatQueries, boost); } } else if (reader != null) { Query query = sourceQuery; Query rewritten; if (sourceQuery instanceof MultiTermQuery) { rewritten = new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS).rewrite(reader, (MultiTermQuery) query); } else { rewritten = query.rewrite(reader); } if (rewritten != query) { // only rewrite once and then flatten again - the rewritten query could have a speacial treatment // if this method is overwritten in a subclass. flatten(rewritten, reader, flatQueries, boost); } // if the query is already rewritten we discard it } // else discard queries } /* * Create expandQueries from flatQueries. * * expandQueries := flatQueries + overlapped phrase queries * * ex1) flatQueries={a,b,c} * => expandQueries={a,b,c} * ex2) flatQueries={a,"b c","c d"} * => expandQueries={a,"b c","c d","b c d"} */ Collectionexpand( CollectionflatQueries ){ SetexpandQueries = new LinkedHashSet<>(); for( Iteratori = flatQueries.iterator(); i.hasNext(); ){ Query query = i.next(); i.remove(); expandQueries.add( query ); float queryBoost = 1f; while (query instanceof BoostQuery) { BoostQuery bq = (BoostQuery) query; queryBoost *= bq.getBoost(); query = bq.getQuery(); } if( !( query instanceof PhraseQuery ) ) continue; for( Iteratorj = flatQueries.iterator(); j.hasNext(); ){ Query qj = j.next(); float qjBoost = 1f; while (qj instanceof BoostQuery) { BoostQuery bq = (BoostQuery) qj; qjBoost *= bq.getBoost(); qj = bq.getQuery(); } if( !( qj instanceof PhraseQuery ) ) continue; checkOverlap( expandQueries, (PhraseQuery)query, queryBoost, (PhraseQuery)qj, qjBoost ); } } return expandQueries; } /* * Check if PhraseQuery A and B have overlapped part. * * ex1) A="a b", B="b c" => overlap; expandQueries={"a b c"} * ex2) A="b c", B="a b" => overlap; expandQueries={"a b c"} * ex3) A="a b", B="c d" => no overlap; expandQueries={} */ private void checkOverlap( CollectionexpandQueries, PhraseQuery a, float aBoost, PhraseQuery b, float bBoost ){ if( a.getSlop() != b.getSlop() ) return; Term[] ats = a.getTerms(); Term[] bts = b.getTerms(); if( fieldMatch && !ats[0].field().equals( bts[0].field() ) ) return; checkOverlap( expandQueries, ats, bts, a.getSlop(), aBoost); checkOverlap( expandQueries, bts, ats, b.getSlop(), bBoost ); } /* * Check if src and dest have overlapped part and if it is, create PhraseQueries and add expandQueries. * * ex1) src="a b", dest="c d" => no overlap * ex2) src="a b", dest="a b c" => no overlap * ex3) src="a b", dest="b c" => overlap; expandQueries={"a b c"} * ex4) src="a b c", dest="b c d" => overlap; expandQueries={"a b c d"} * ex5) src="a b c", dest="b c" => no overlap * ex6) src="a b c", dest="b" => no overlap * ex7) src="a a a a", dest="a a a" => overlap; * expandQueries={"a a a a a","a a a a a a"} * ex8) src="a b c d", dest="b c" => no overlap */ private void checkOverlap( CollectionexpandQueries, Term[] src, Term[] dest, int slop, float boost ){ // beginning from 1 (not 0) is safe because that the PhraseQuery has multiple terms // is guaranteed in flatten() method (if PhraseQuery has only one term, flatten() // converts PhraseQuery to TermQuery) for( int i = 1; i < src.length; i++ ){ boolean overlap = true; for( int j = i; j < src.length; j++ ){ if( ( j - i ) < dest.length && !src[j].text().equals( dest[j-i].text() ) ){ overlap = false; break; } } if( overlap && src.length - i < dest.length ){ PhraseQuery.Builder pqBuilder = new PhraseQuery.Builder(); for( Term srcTerm : src ) pqBuilder.add( srcTerm ); for( int k = src.length - i; k < dest.length; k++ ){ pqBuilder.add( new Term( src[0].field(), dest[k].text() ) ); } pqBuilder.setSlop( slop ); Query pq = pqBuilder.build(); if (boost != 1f) { pq = new BoostQuery(pq, 1f); } if(!expandQueries.contains( pq ) ) expandQueries.add( pq ); } } } QueryPhraseMap getRootMap( Query query ){ String key = getKey( query ); QueryPhraseMap map = rootMaps.get( key ); if( map == null ){ map = new QueryPhraseMap( this ); rootMaps.put( key, map ); } return map; } /* * Return 'key' string. 'key' is the field name of the Query. * If not fieldMatch, 'key' will be null. */ private String getKey( Query query ){ if( !fieldMatch ) return null; while (query instanceof BoostQuery) { query = ((BoostQuery) query).getQuery(); } if( query instanceof TermQuery ) return ((TermQuery)query).getTerm().field(); else if ( query instanceof PhraseQuery ){ PhraseQuery pq = (PhraseQuery)query; Term[] terms = pq.getTerms(); return terms[0].field(); } else if (query instanceof MultiTermQuery) { return ((MultiTermQuery)query).getField(); } else throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." ); } /* * Save the set of terms in the queries to termSetMap. * * ex1) q=name:john * - fieldMatch==true * termSetMap=Map<"name",Set<"john">> * - fieldMatch==false * termSetMap=Map<"john">> * * ex2) q=name:john title:manager * - fieldMatch==true * termSetMap=Map<"name",Set<"john">, * "title",Set<"manager">> * - fieldMatch==false * termSetMap=Map<"john","manager">> * * ex3) q=name:"john lennon" * - fieldMatch==true * termSetMap=Map<"name",Set<"john","lennon">> * - fieldMatch==false * termSetMap=Map<"john","lennon">> */ void saveTerms( CollectionflatQueries, IndexReader reader ) throws IOException{ for( Query query : flatQueries ){ while (query instanceof BoostQuery) { query = ((BoostQuery) query).getQuery(); } Setterms = getTerms( query ); SettermSet = getTermSet( query ); if( query instanceof TermQuery ){ termSet.add( ((TermQuery)query).getTerm().text() ); } else if( query instanceof PhraseQuery ){ int[] positions=((PhraseQuery)query).getPositions(); Term[] terms2 =((PhraseQuery)query).getTerms(); Phrase phrase = new Phrase(); for (int i = 0; i < terms2.length; i++) { phrase.add(terms2[i].text(), positions[i]); termSet.add( terms2[i].text() ); } if(terms2.length > 1){ terms.add(phrase); } } else if (query instanceof MultiTermQuery && reader != null) { BooleanQuery mtqTerms = (BooleanQuery) query.rewrite(reader); for (BooleanClause clause : mtqTerms) { termSet.add (((TermQuery) clause.getQuery()).getTerm().text()); } } else throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." ); } } private SetgetTermSet( Query query ){ String key = getKey( query ); Setset = termSetMap.get( key ); if( set == null ){ set = new HashSet<>(); termSetMap.put( key, set ); } return set; } //用fastHighlighter private SetgetTerms( Query query ){ String key = getKey( query ); Setset = phraseMap.get( key ); if( set == null ){ set = new HashSet<>(); phraseMap.put( key, set ); } return set; } public SetgetTermSet( String field ){ return termSetMap.get( fieldMatch ? field : null ); } /** * 短语不进行分词 * @param field * @return */ public SetgetPhrases( String field ){ return phraseMap.get( fieldMatch ? field : null ); } /** * * @return QueryPhraseMap */ public QueryPhraseMap getFieldTermMap( String fieldName, String term ){ QueryPhraseMap rootMap = getRootMap( fieldName ); return rootMap == null ? null : rootMap.subMap.get( term ); } /** * * @return QueryPhraseMap */ public QueryPhraseMap searchPhrase( String fieldName, final ListphraseCandidate ){ QueryPhraseMap root = getRootMap( fieldName ); if( root == null ) return null; return root.searchPhrase( phraseCandidate ); } private QueryPhraseMap getRootMap( String fieldName ){ return rootMaps.get( fieldMatch ? fieldName : null ); } public int nextTermOrPhraseNumber(){ return termOrPhraseNumber++; } /** * Internal structure of a query for highlighting: represents * a nested query structure */ public static class QueryPhraseMap { boolean terminal; int slop; // valid if terminal == true and phraseHighlight == true float boost; // valid if terminal == true int termOrPhraseNumber; // valid if terminal == true FieldQuery fieldQuery; MapsubMap = new HashMap<>(); public QueryPhraseMap( FieldQuery fieldQuery ){ this.fieldQuery = fieldQuery; } void addTerm( Term term, float boost ){ QueryPhraseMap map = getOrNewMap( subMap, term.text() ); map.markTerminal( boost ); } private QueryPhraseMap getOrNewMap( MapsubMap, String term ){ QueryPhraseMap map = subMap.get( term ); if( map == null ){ map = new QueryPhraseMap( fieldQuery ); subMap.put( term, map ); } return map; } void add( Query query, IndexReader reader ) { float boost = 1f; while (query instanceof BoostQuery) { BoostQuery bq = (BoostQuery) query; query = bq.getQuery(); boost = bq.getBoost(); } if( query instanceof TermQuery ){ addTerm( ((TermQuery)query).getTerm(), boost ); } else if( query instanceof PhraseQuery ){ PhraseQuery pq = (PhraseQuery)query; Term[] terms = pq.getTerms(); Mapmap = subMap; QueryPhraseMap qpm = null; for( Term term : terms ){ qpm = getOrNewMap( map, term.text() ); map = qpm.subMap; } qpm.markTerminal( pq.getSlop(), boost ); } else throw new RuntimeException( "query \"" + query.toString() + "\" must be flatten first." ); } public QueryPhraseMap getTermMap( String term ){ return subMap.get( term ); } private void markTerminal( float boost ){ markTerminal( 0, boost ); } private void markTerminal( int slop, float boost ){ this.terminal = true; this.slop = slop; this.boost = boost; this.termOrPhraseNumber = fieldQuery.nextTermOrPhraseNumber(); } public boolean isTerminal(){ return terminal; } public int getSlop(){ return slop; } public float getBoost(){ return boost; } public int getTermOrPhraseNumber(){ return termOrPhraseNumber; } public QueryPhraseMap searchPhrase( final ListphraseCandidate ){ QueryPhraseMap currMap = this; for( TermInfo ti : phraseCandidate ){ currMap = currMap.subMap.get( ti.getText() ); if( currMap == null ) return null; } return currMap.isValidTermOrPhrase( phraseCandidate ) ? currMap : null; } public boolean isValidTermOrPhrase( final ListphraseCandidate ){ // check terminal if( !terminal ) return false; // if the candidate is a term, it is valid if( phraseCandidate.size() == 1 ) return true; // else check whether the candidate is valid phrase // compare position-gaps between terms to slop int pos = phraseCandidate.get( 0 ).getPosition(); for( int i = 1; i < phraseCandidate.size(); i++ ){ int nextPos = phraseCandidate.get( i ).getPosition(); if( Math.abs( nextPos - pos - 1 ) > slop ) return false; pos = nextPos; } return true; } } }  

Elasticsearch之高亮进阶-高性能高亮器, 让Elasticsearch飞一会儿相关推荐

  1. 【Elasticsearch】Elasticsearch能检索出来,但不能正确高亮怎么办 高亮部分数据 高亮指定字符串 Ngram

    1.概述 转载:Elasticsearch能检索出来,但不能正确高亮怎么办? 官网参考:https://www.elastic.co/guide/en/elasticsearch/reference/ ...

  2. (八)JVM成神路之GC分区篇:G1、ZGC、ShenandoahGC高性能收集器深入剖析

    引言 在<GC分代篇>中,我们曾对JVM中的分代GC收集器进行了全面阐述,而在本章中重点则是对JDK后续新版本中研发推出的高性能收集器进行深入剖析,但在开始前,先来看看JDK的发布记录中关 ...

  3. Java高性能解析器实现思路及方法

    在某些情况下,你可能需要在Java中实现你自己的数据或语言解析器,也许是这种数据格式或语言缺乏标准的Java或开源解析器可以使用.或者虽然有现成的解析器实现,但它们要么太慢,要么太占内存,要么就是没有 ...

  4. 使用Docker 安装Elasticsearch、Elasticsearch-head、IK分词器 和使用

    使用Docker 安装Elasticsearch.Elasticsearch-head.IK分词器 和使用 原文:使用Docker 安装Elasticsearch.Elasticsearch-head ...

  5. 解析器 java_Java高性能解析器实现思路及方法学习

    当你必须自己实现一个解析器时,你对它的期望会有很多,包括性能良好.灵活.特性丰富.方便使用,以及便于维护等等.说到底,这也是你自己的代码.在本文中,我将为你介绍在Java中实现高性能解析器的一种方式, ...

  6. python进阶20装饰器

    原创博客地址:python进阶20装饰器 Nested functions Python允许创建嵌套函数,这意味着我们可以在函数内声明函数并且所有的作用域和声明周期规则也同样适用. 1 2 3 4 5 ...

  7. elasticsearch分词练习、自定义分词器练习

    elasticsearch分词练习.自定义分词器练习 分词练习 自定义分词器 分词练习 准备数据 post metric_zabbix/log {"@message":" ...

  8. 自定义高性能播放器, 实现边下边播缓存等功能

    VideoPlayerDemo 项目地址:Zhaoss/VideoPlayerDemo  简介:自定义高性能播放器, 实现边下边播缓存等功能 更多:作者   提 Bug 标签: 本项目使用播放器是ij ...

  9. elasticsearch倒排索引原理与中文分词器

    1. 索引的方式: 1.1 正向索引 正排表是以文档的ID为关键字,表中记录文档中每个字的位置信息,查找时扫描表中每个文档中字的信息直到找出所有包含查询关键字的文档. 这种组织方法在建立索引的时候结构 ...

最新文章

  1. 如何管理好自己的性格?
  2. 【分布计算环境学习笔记】9 Web Service
  3. JZOJ 3632. 【汕头市选2014】舞伴
  4. 多面体 (Multipatch)
  5. 基于JAVA+SQL Server数据库项目——学生校园卡管理系统(SSH框架)
  6. stvd c语言编译器,STM8--STVD编译工具安装与程序下载
  7. JS随手记——三目表达式
  8. mysql创建索引视图_mysql中创建视图、索引
  9. 分布式 | 浅谈 dble 引入 ClickHouse 的配置操作
  10. wifi的html页面,笔记本怎么设置wifi
  11. 1564 区间的价值
  12. 好文:中国Saas蜕变史
  13. 三年级语文计算机之父教学反思,三年级语文教学反思15篇
  14. java和python工资-Java和Python哪个薪资更高?
  15. 前端之HTML 表格
  16. 计算机逻辑结构,计算机的逻辑结构.ppt
  17. 封闭式基金高折价蕴涵巨大投资机会
  18. 计算机电源检查照,如何测试计算机的电源?检查计算机功耗的操作方法
  19. B-样条曲线教程(B-spline Curves Notes)目录
  20. PPT打开里面的EXCEL表格提示:无法开始运行打开此目标所需的应用

热门文章

  1. rabbitmq集群安装与配置(故障恢复)
  2. 解散群通知怎么写_家人微信群想解散通知怎么写
  3. IPP简介及windows下安装说明
  4. tampermonkey如何寻找_Tampermonkey脚本安装问题及自用脚本推荐
  5. QT 怎么获取linux本机的IP地址?
  6. 前端html字体设置
  7. 编译Busybox产生的两个错误
  8. 荣耀3路由器设置虚拟服务器,荣耀路由3怎么设置上网?(电脑)
  9. 叽歪网创始人李卓桓:叽歪的微信息模式
  10. Java解析PDF文件(PDFBOX、itext解析PDF)导出PDF中的子图片,去除PDF中的水印