第三部分 :   Understanding Analyzers, Tokenizers, and Filters
Filter Descriptions
You configure each filter with a <filter> element in schema.xml as a child of <analyzer>, following the <tokenizer> element. Filter definitions should follow a tokenizer or another filter definition because they take a TokenStream as input. For example.
过滤器的配置位置在 <t okenizer> 或者过滤器下,过滤器接收TokenStream 为输入:
<fieldType name="text" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>...
过滤器工厂要实现org.apache.solr.analysis.TokenFilterFactory .过滤器和分词器类似,都需要tokenstream生成tokens.你可以在分词器下任意的组合过滤器
Arguments may be passed to tokenizer factories to modify their behavior by setting attributes on the <filter> element. For example:
<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="; " />
<filter class="solr.LengthFilterFactory" min="2" max="7"/>
Filters discussed in this section:
ASCII Folding Filter
Beider-Morse Filter
Classic Filter
Common Grams Filter
Collation Key Filter
Daitch-Mokotoff Soundex Filter
Double Metaphone Filter
Edge N-Gram Filter
English Minimal Stem Filter
Fingerprint Filter
Hunspell Stem Filter
Hyphenated Words Filter
ICU Folding Filter
ICU Normalizer 2 Filter
ICU Transform Filter
Keep Word Filter
KStem Filter
Length Filter
Lower Case Filter
Managed Stop Filter
Managed Synonym Filter
N-Gram Filter
Numeric Payload Token Filter
Pattern Replace Filter
Phonetic Filter
Porter Stem Filter
Remove Duplicates Token Filter
Reversed Wildcard Filter
Shingle Filter
Snowball Porter Stemmer Filter
Standard Filter
Stop Filter
Suggest Stop Filter
Synonym Filter
Token Offset Payload Filter
Trim Filter
Type As Payload Filter
Type Token Filter
Word Delimiter Filter
ASCII Folding Filter
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts characters from the following Unicode blocks:
Factory class: solr.ASCIIFoldingFilterFactory
Arguments: None
<filter class="solr.ASCIIFoldingFilterFactory"/>
In: "á" (Unicode character 00E1)
Out: "a" (ASCII character 97)
Beider-Morse Filter
Implements the Beider-Morse Phonetic Matching (BMPM) algorithm, which allows identification of similar names, even if they are spelled differently or in different languages. More information about how this works is available in the section on Phonetic Matching.
Factory class: solr.BeiderMorseFilterFactory
nameType: Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If not processing Ashkenazi or Sephardic names, use GENERIC.
ruleType: Types of rules to apply. Valid values are APPROX or EXACT.
concat: Defines if multiple possible matches should be combined with a pipe ("|").
languageSet: The language set to use. The value "auto" will allow the Filter to identify the language, or a comma-separated list can be supplied.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX"
concat="true" languageSet="auto">
这个过滤器是用来做人名识别的,用到了一个算法, Beider-Morse Phonetic Matching (BMPM) algorithm,好像很厉害的字样,了解一下.能识别不同语言,不同类型的人名
Classic Filter
This filter takes the output of the Classic Tokenizer and strips periods from acronyms and "'s" from possessives.
Factory class: solr.ClassicFilterFactory
Arguments: None
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.ClassicFilterFactory"/>
In: "I.B.M. cat's can't"
Tokenizer to Filter: "I.B.M", "cat's", "can't"
Out: "IBM", "cat", "can't"
Common Grams Filter
This filter creates word shingles by combining common tokens such as stop words with regular tokens. This is useful for creating phrase queries containing common words, such as "the cat." Solr normally ignores stop words in queried phrases, so searching for "the cat" would return all matches for the word "cat."
Factory class: solr.CommonGramsFilterFactory
words: (a common word file in .txt format) Provide the name of a common word file, such as stopwords.txt.
format: (optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball" so Solr can read the stopwords file.
ignoreCase: (boolean) If true, the filter ignores the case of words when comparing them to the common word file. The default is false.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
In: "the Cat"
Tokenizer to Filter: "the", "Cat"
Out: "the_cat"
Collation Key Filter

Collation allows sorting of text in a language-sensitive way. It is usually used for sorting, but can also be used with advanced searches. We've covered this in much more detail in the section on Unicode Collation.
Daitch-Mokotoff Soundex Filter
Implements the Daitch-Mokotoff Soundex algorithm, which allows identification of similar names, even if they are
spelled differently. More information about how this works is available in the section on Phonetic Matching.
Factory class: solr.DaitchMokotoffSoundexFilterFactory
inject : (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact
spelling of the target word may not match.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DaitchMokotoffSoundexFilterFactory" inject="true"/>
Double Metaphone Filter
This filter creates tokens using the DoubleMetaphone encoding algorithm from commons-codec. For more
information, see the Phonetic Matching section.
Factory class: solr.DoubleMetaphoneFilterFactory
inject: (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact
spelling of the target word may not match.
maxCodeLength: (integer) The maximum length of the code to be generated.
Default behavior for inject (true): keep the original token and add phonetic token(s) at the same position.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory"/>
In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "Kuczewski"(4), "KSSK"(4), "KXFS"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the
token they were derived from (immediately preceding). Note that "Kuczewski" has two encodings, which are
added at the same position.
Discard original token (inject="false").
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "KSSK"(4), "KXFS"(4)
Note that "Kuczewski" has two encodings, which are added at the same position
Edge N-Gram Filter
This filter generates edge n-gram tokens of sizes within the given range.
Factory class: solr.EdgeNGramFilterFactory
minGramSize: (integer, default 1) The minimum gram size.
maxGramSize: (integer, default 1) The maximum gram size.
Default behavior.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory"/>
In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "f", "s", "a", "t"
A range of 1 to 4.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="4"/>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"
A range of 4 to 6.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="6"/>
In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "four", "scor", "score", "twen", "twent", "twenty"
这个和前面的edge N-gram tokenzer 起到的效果是一样的.就是没有选取方向的参数而已
English Minimal Stem Filter
This filter stems plural English words to their singular form.
Factory class: solr.EnglishMinimalStemFilterFactory
Arguments: None
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
In: "dogs cats"
Tokenizer to Filter: "dogs", "cats"
Out: "dog", "cat"
Fingerprint Filter
This filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.
This can be useful for clustering/linking use cases.
Factory class: solr.FingerprintFilterFactory
separator : The character used to separate tokens combined into the single output token. Defaults to " " (a
space character).
maxOutputTokenSize : The maximum length of the summarized output token. If exceeded, no output token is
emitted. Defaults to 1024.
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.FingerprintFilterFactory" separator="_" />
In: "the quick brown fox jumped over the lazy dog"
Tokenizer to Filter: "the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"
Out: "brown_dog_fox_jumped_lazy_over_quick_the"
Hunspell Stem Filter
The Hunspell Stem Filter provides support for several languages. You must provide the dictionary (.dic) and
rules (.aff) files for each language you wish to use with the Hunspell Stem Filter. You can download those
language files here. Be aware that your results will vary widely based on the quality of the provided dictionary
and rules files. For example, some languages have only a minimal word list with no morphological information.
On the other hand, for languages that have no stemmer but do have an extensive dictionary file, the Hunspell
stemmer may be a good choice.
Factory class: solr.HunspellStemFilterFactory
dictionary: (required) The path of a dictionary file.
affix: (required) The path of a rules file.
ignoreCase: (boolean) controls whether matching is case sensitive or not. The default is false.
strictAffixParsing: (boolean) controls whether the affix parsing is strict or not. If true, an error while
reading an affix rule causes a ParseException, otherwise is ignored. The default is true.
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HunspellStemFilterFactory"
strictAffixParsing="true" />
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Hyphenated Words Filter
This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace in the field test. If a token ends with a hyphen, it is joined with the following token and the hyphen is discarded. Note that for this filter to work properly, the upstream tokenizer must not remov trailing hyphen characters. This filter is generally only useful at index time.
Factory class: solr.HyphenatedWordsFilterFactory
Arguments: None
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
In: "A hyphen- ated word"
Tokenizer to Filter: "A", "hyphen-", "ated", "word"
Out: "A", "hyphenated", "word"
Keep Word Filter
This filter discards all tokens except those that are listed in the given word list. This is the inverse of the Stop Words Filter. This filter can be useful for building specialized indices for a constrained set of terms.
Factory class: solr.KeepWordFilterFactory
words: (required) Path of a text file containing the list of keep words, one per line. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple filename in the Solr config directory.
ignoreCase: (true/false) If true then comparisons are done case-insensitively. If this argument is true, then the
words file is assumed to contain only lowercase words. The default is false.
enablePositionIncrements: if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luc eneMatchVersion is 5.0 or later.
Where keepwords.txt contains:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "funny"
Same keepwords.txt, case insensitive:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "Happy", "funny"
Using LowerCaseFilterFactory before filtering for keep words, no ignoreCase flag.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Filter to Filter: "happy", "sad", "or", "funny"
Out: "happy", "funny"
KStem Filter
KStem is an alternative to the Porter Stem Filter for developers looking for a less aggressive stemmer. KStem was written by Bob Krovetz, ported to Lucene by Sergio Guzman-Lara (UMASS Amherst). This stemmer is only
appropriate for English language text.
Factory class: solr.KStemFilterFactory
Arguments: None
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.KStemFilterFactory"/>
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Length Filter
This filter passes tokens whose length falls within the min/max limit specified. All other tokens are discarded.
Factory class: solr.LengthFilterFactory
min: (integer, required) Minimum token length. Tokens shorter than this are discarded.
max: (integer, required, must be >= min) Maximum token length. Tokens longer than this are discarded.
enablePositionIncrements: if luceneMatchVersion is 4.3 or earlier and enablePositionIncrement
s="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luc
eneMatchVersion is 5.0 or later.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="7"/>
In: "turn right at Albuquerque"
Tokenizer to Filter: "turn", "right", "at", "Albuquerque"
Out: "turn", "right"
Lower Case Filter
Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left
Factory class: solr.LowerCaseFilterFactory
Arguments: None
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
In: "Down With CamelCase"
Tokenizer to Filter: "Down", "With", "CamelCase"
Out: "down", "with", "camelcase"
Managed Stop Filter
This is specialized version of the Stop Words Filter Factory that uses a set of stop words that are managed from
managed: The name that should be used for this set of stop words in the managed REST API.
With this configuration the set of words is named "english" and can be managed via /solr/collection_name
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedStopFilterFactory" managed="english"/>
See Stop Filter for example input/output
Managed Synonym Filter
This is specialized version of the Synonym Filter Factory that uses a mapping on synonyms that is managed
from a REST API.
managed: The name that should be used for this mapping on synonyms in the managed REST API.
With this configuration the set of mappings is named "english" and can be managed via /solr/collection_n
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedSynonymFilterFactory" managed="english"/>
See Synonym Filter for example input/output.
N-Gram Filter
Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by gram
Factory class: solr.NGramFilterFactory
minGramSize: (integer, default 1) The minimum gram size.
maxGramSize: (integer, default 2) The maximum gram size.
Default behavior.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory"/>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"
A range of 1 to 4.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="4"/>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"
A range of 3 to 5.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="5"/>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"
Numeric Payload Token Filter
This filter adds a numeric floating point payload value to tokens that match a given type. Refer to the Javadoc for the org.apache.lucene.analysis.Token class for more information about token types and payloads.
Factory class: solr.NumericPayloadTokenFilterFactory
payload: (required) A floating point value that will be added to all matching tokens.
typeMatch: (required) A token type name string. Tokens with a matching type name will have their payload set
to the above floating point value.
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NumericPayloadTokenFilterFactory" payload="0.75"
In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0.75], "bang"[0.75], "boom"[0.75]
Pattern Replace Filter
This filter applies a regular expression to each token and, for those that match, substitutes the given replacement
string in place of the matched pattern. Tokens which do not match are passed though unchanged.
Factory class: solr.PatternReplaceFilterFactory
pattern: (required) The regular expression to test against each token, as per java.util.regex.Pattern.
replacement: (required) A string to substitute in place of the matched pattern. This string may contain
references to capture groups in the regex pattern. See the Javadoc for java.util.regex.Matcher.
replace: ("all" or "first", default "all") Indicates whether all occurrences of the pattern in the token should be
replaced, or only the first.
Simple string replace:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="cat" replacement="dog"/>
In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogydog"
String replacement, first occurrence only:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="cat" replacement="dog"
In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogycat"
More complex pattern with capture group reference in the replacement. Tokens that start with non-numeric
characters and end with digits will have an underscore inserted before the numbers. Otherwise the token is
passed through.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(\D+)(\d+)$"
In: "cat foo1234 9987 blah1234foo"
Tokenizer to Filter: "cat", "foo1234", "9987", "blah1234foo"
Out: "cat", "foo_1234", "9987", "blah1234foo"
Phonetic Filter
This filter creates tokens using one of the phonetic encoding algorithms in the org.apache.commons.codec.
language package. For more information, see the section on Phonetic Matching.
Factory class: solr.PhoneticFilterFactory
encoder: (required) The name of the encoder to use. The encoder name must be one of the following (case
insensitive): "DoubleMetaphone", "Metaphone", "Soundex", "RefinedSoundex", "Caverphone" (v2.0), "CologneP
honetic", or "Nysiis".
inject: (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact
spelling of the target word may not match.
maxCodeLength: (integer) The maximum length of the code to be generated by the Metaphone or Double
Metaphone encoders.
Default behavior for DoubleMetaphone encoding.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"/>
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the
token they were derived from (immediately preceding).
Discard original token.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)
Default Soundex encoder.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="Soundex"/>
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)
Porter Stem Filter
This filter applies the Porter Stemming Algorithm for English. The results are similar to using the Snowball Porter
Stemmer with the language="English" argument. But this stemmer is coded directly in Java and is not based
on Snowball. It does not accept a list of protected words and is only appropriate for English language text.
However, it has been benchmarked as four times faster than the English Snowball stemmer, so can provide a
performance enhancement.
Factory class: solr.PorterStemFilterFactory
Arguments: None
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.PorterStemFilterFactory"/>
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Remove Duplicates Token Filter
The filter removes duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same
text and position values.
Factory class: solr.RemoveDuplicatesTokenFilterFactory
Arguments: None
One example of where RemoveDuplicatesTokenFilterFactory is in situations where a synonym file is
being used in conjuntion with a stemmer causes some synonyms to be reduced to the same stem. Consider the
following entry from a synonyms.txt file:
Television, Televisions, TV, TVs
When used in the following configuration:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
In: "Watch TV"
Tokenizer to Synonym Filter: "Watch"(1) "TV"(2)
Synonym Filter to Stem Filter: "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)
Stem Filter to Remove Dups Filter: "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)
Out: "Watch"(1) "Television"(2) "TV"(2)
Reversed Wildcard Filter
This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards are not
Factory class: solr.ReversedWildcardFilterFactory
withOriginal (boolean) If true, the filter produces both original and reversed tokens at the same positions. If
false, produces only reversed tokens.
maxPosAsterisk (integer, default = 2) The maximum position of the asterisk wildcard ('*') that triggers the
reversal of the query term. Terms with asterisks at positions above this value are not reversed.
maxPosQuestion (integer, default = 1) The maximum position of the question mark wildcard ('?') that triggers
the reversal of query term. To reverse only pure suffix queries (queries with a single leading asterisk), set this to
0 and maxPosAsterisk to 1.
maxFractionAsterisk (float, default = 0.0) An additional parameter that triggers the reversal if asterisk ('*')
position is less than this fraction of the query token length.
minTrailing (integer, default = 2) The minimum number of trailing characters in a query token after the last
wildcard character. For good performance this should be set to a value larger than 1.
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
In: "*foo *bar"
Tokenizer to Filter: "*foo", "*bar"
Out: "oof*", "rab*"
Shingle Filter
This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a
single token.
Factory class: solr.ShingleFilterFactory
minShingleSize: (integer, default 2) The minimum number of tokens per shingle.
maxShingleSize: (integer, must be >= 2, default 2) The maximum number of tokens per shingle.
outputUnigrams: (true/false) If true (the default), then each individual token is also included at its original
outputUnigramsIfNoShingles: (true/false) If false (the default), then individual tokens will be output if no
shingles are possible.
tokenSeparator: (string, default is " ") The default string to use when joining adjacent tokens to form a shingle.
Default behavior.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"/>
In: "To be, or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)
A shingle size of four, do not include original token.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="4"
In: "To be, or not to be."
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)
Out: "To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or not
to"(3), "or not to be"(3), "not to"(4), "not to be"(4), "to be"(5)
使用N-Gram 对tokens进行组合,成新的词汇. 可以设置包含或者不包含原生位置tokens词
Snowball Porter Stemmer Filter
Factory class: solr.SnowballPorterFilterFactory
language: (default "English") The name of a language, used to select the appropriate Porter stemmer to use.
Case is significant. This string is used to select a package name in the "org.tartarus.snowball.ext" class
protected: Path of a text file containing a list of protected words, one per line. Protected words will not be
stemmed. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple file
name in the Solr config directory.
Default behavior:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory"/>
In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flip", "flip"
French stemmer, English words:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flipped", "flipping"
Spanish stemmer, Spanish words:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Spanish"/>
In: "cante canta"
Tokenizer to Filter: "cante", "canta"
Out: "cant", "cant"


Standard Filter
This filter removes dots from acronyms and the substring "'s" from the end of tokens. This filter depends on the
tokens being tagged with the appropriate term-type to recognize acronyms and words with apostrophes.
Factory class: solr.StandardFilterFactory
Stop Filter
This filter discards, or stops analysis of, tokens that are on the given stop words list. A standard stop words list is
included in the Solr config directory, named stopwords.txt, which is appropriate for typical English language text.
Factory class: solr.StopFilterFactory
words: (optional) The path to a file that contains a list of stop words, one per line. Blank lines and lines that
begin with "#" are ignored. This may be an absolute path, or path relative to the Solr config directory.
format: (optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball"
so Solr can read the stopwords file.
ignoreCase: (true/false, default false) Ignore case when testing for stop words. If true, the stop list should contain lowercase words.
enablePositionIncrements: if luceneMatchVersion is 4.4 or earlier and enablePositionIncrements="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or later.
Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words.
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "what"(4)
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "what"(4)
Suggest Stop Filter
Like Stop Filter, this filter discards, or stops analysis of, tokens that are on the given stop words list. Suggest
Stop Filter differs from Stop Filter in that it will not remove the last token unless it is followed by a token
separator. For example, a query "find the" would preserve the 'the' since it was not followed by a space,
punctuation etc., and mark it as a KEYWORD so that following filters will not change or remove it. By contrast, a
query like "find the popsicle" would remove "the" as a stopword, since it's followed by a space. When
using one of the analyzing suggesters, you would normally use the ordinary StopFilterFactory in your index
analyzer and then SuggestStopFilter in your query analyzer.
Factory class: solr.SuggestStopFilterFactory
words: (optional; default: StopAnalyzer#ENGLISH_STOP_WORDS_SET ) The name of a stopwords file to
format: (optional; default: wordset) Defines how the words file will be parsed. If words is not specified, then f
ormat must not be specified. The valid values for the format option are:
wordset: This is the default format, which supports one word per line (including any intra-word
whitespace) and allows whole line comments begining with the "#" character. Blank lines are ignored.
snowball: This format allows for multiple words specified on each line, and trailing comments may be
specified using the vertical line ("|"). Blank lines are ignored.
ignoreCase: (optional; default: false) If true, matching is case-insensitive.
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SuggestStopFilterFactory" ignoreCase="true"
words="stopwords.txt" format="wordset"/>
In: "The The"
Tokenizer to Filter: "the"(1), "the"(2)
Out: "the"(2)
Synonym Filter
This filter does synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then
the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at
the same position as the original token.
Factory class: solr.SynonymFilterFactory

For the following examples, assume a synonyms file named mysynonyms.txt:
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="mysynonyms.txt"/>
In: "teh small couch"
Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)
Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.SynonymFilterFactory" synonyms="mysynonyms.txt"/>
In: "teh ginormous, humungous sofa"
Tokenizer to Filter: "teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)
Out: "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)

Token Offset Payload Filter
This filter adds the numeric character offsets of the token as a payload value for that token.
Factory class: solr.TokenOffsetPayloadTokenFilterFactory
Arguments: None
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TokenOffsetPayloadTokenFilterFactory"/>
In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0,4], "bang"[5,9], "boom"[10,14]
token的偏移量纪录 绝对位置.
Trim Filter
This filter trims leading and/or trailing whitespace from tokens. Most tokenizers break tokens at whitespace, so  this filter is most often used for special situations.
Factory class: solr.TrimFilterFactory
updateOffsets: if luceneMatchVersion is 4.3 or earlier and updateOffsets="true", trimmed tokens'
start and end offsets will be updated to those of the first and last characters (plus one) remaining in the token. T
his argument is invalid if luceneMatchVersion is 5.0 or later.
The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove
<tokenizer class="solr.PatternTokenizerFactory" pattern=","/>
<filter class="solr.TrimFilterFactory"/>
In: "one, two , three ,four "
Tokenizer to Filter: "one", " two ", " three ", "four "
Out: "one", "two", "three", "four"
Type As Payload Filter
This filter adds the token's type, as an encoded byte sequence, as its payload.
Factory class: solr.TypeAsPayloadTokenFilterFactory
Arguments: None
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TypeAsPayloadTokenFilterFactory"/>
In: "Pay Bob's I.O.U."
Tokenizer to Filter: "Pay", "Bob's", "I.O.U."
Out: "Pay"[<ALPHANUM>], "Bob's"[<APOSTROPHE>], "I.O.U."[<ACRONYM>]
Type Token Filter
This filter blacklists or whitelists a specified list of token types, assuming the tokens have type metadata
associated with them. For example, the UAX29 URL Email Tokenizer emits "<URL>" and "<EMAIL>" typed
tokens, as well as other types. This filter would allow you to pull out only e-mail addresses from text as tokens, if
you wish.
Factory class: solr.TypeTokenFilterFactory
types: Defines the location of a file of types to filter.
useWhitelist: If true, the file defined in types should be used as include list. If false, or undefined, the file
defined in types is used as a blacklist.
enablePositionIncrements: if luceneMatchVersion is 4.3 or earlier and enablePositionIncrement
s="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luc
eneMatchVersion is 5.0 or later.
<filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt"
Word Delimiter Filter
This filter splits tokens at word delimiters. The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This can be disabled by setting split
A transition from alpha to numeric characters or vice versa: "Gonzo5000" -> "Gonzo", "5000" "4500XL" ->
"4500", "XL". This can be disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O'Reilly's" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" -> "hot", "spot"
Factory class: solr.WordDelimiterFilterFactory

Solr入门之官方文档6.0阅读笔记系列(八) 相关过滤器相关推荐

  1. ExoPlayer详解——入门(官方文档)

    ExoPlayer详解系列文章 ExoPlayer详解--入门(官方文档) ExoPlayer详解--媒体类型(官方文档) ExoPlayer详解--高级主题(官方文档) 一.ExoPlayer,你好 ...

  2. CZSC 官方文档(0.6.8)

    CZSC 是缠中说禅技术分析理论的 python 实现,该理论源自 缠中说禅 博客. 注意:文档写的比较匆忙,很多地方可能没有讲清楚.有什么疑问,可以到 常见问题(FAQ) 看看.看完还有疑问,可以直 ...

  3. 阿里云OSS | 对象存储服务快速入门 | 参考官方文档实现使用JavaSDK上传文件 | 本地上传与web上传案例

    参考文档 : 点击查看 文章目录 运行环境 一.OSS相关概念 1.1 Storage Class 存储类型 1.2 Bucket 存储空间 1.3 Object 对象 1.4 Region 地域 1 ...

  4. Ant Design 入门-参照官方文档使用组件

    微信小程序开发交流qq群   173683895    承接微信小程序开发.扫码加微信. 先来一个按钮组件使用的对比,官方文档的(不能直接用)和实际能用的. 官网demo: import { Tabl ...

  5. jeecg-boot 官方文档v2.0快速开始之hello world 前端教程

    jeecg-boot 官方文档 快速开始之hello word 前端踩过的坑 前端新建vue  helloworldTest文件,若新建文件无vue选项,请先在idea 上按装vue 插件 2. 新建 ...

  6. 微软官网html官方文档,微软官方的.net系列文档

    获取Windows操作系统的CPU使用率以及内存使用率 此功能参考了ProcessHacker项目的代码. 声明定义 typedef struct _UINT64_DELTA { ULONG64 Va ...

  7. ExoPlayer详解(官方文档-入门)

    目录 ExoPlayer详解系列文章-入门 一.前言 二.优缺点比较 三.概述 ExoPlayer详解--入门(官方文档) 添加ExoPlayer作为依赖项 1.添加依赖 2.添加ExoPlayer模 ...

  8. 一起学微软Power BI系列-官方文档-入门指南(2)获取源数据

    阅读目录 1.系列文章说明 2.入门指南(2)获取数据源 3.资源 我们在文章: 一起学微软Power BI系列-官方文档-入门指南(1)Power BI初步介绍中,我们介绍了官方入门文档的第一章.今 ...

  9. ExoPlayer详解——高级主题(官方文档)

    ExoPlayer详解系列文章 ExoPlayer详解--入门(官方文档) ExoPlayer详解--媒体类型(官方文档) ExoPlayer详解--高级主题(官方文档) 一.数字版权管理 ExoPl ...


  1. nginx 通过proxy_next_upstream实现容灾和重复处理问题
  2. mask-conditional contrast-GAN
  3. xampp无法启动apache,提示terminating worker thread 0
  4. linux 查看软连接_linux删除原理
  5. 9300万美元投资涌入 新加坡成亚太最大数据中心
  6. asp.net自定义控件的使用
  7. signature=fc89d4352b6699754c14ce282ec75426,Method for Assembly of Nucleic Acid Sequence Data
  8. java 适用参数_Java功能的适用性
  9. golang 反射_Golang 会淘汰 Python 吗?
  10. 【MATLAB】无人驾驶车辆的模型预测控制技术(精简讲解和代码)【运动学轨迹规划】
  11. [转]Linux之ACL权限
  12. ORA-12547: TNS:lost contact导致数据库无法启动
  13. mask rcnn算法分析_实例分割综述(单阶段/两阶段/实时分割算法汇总)
  14. 自定义右键菜单,禁用浏览器自带的右键菜单[右键菜单实现--Demo]
  15. 【数据结构和算法笔记】KMP算法介绍
  16. html title 不显示_SEO入门教程二:学习最基础的html代码知识
  17. 服务器系统怎样设置定时开关机,如何配置服务器定时开关机
  18. windows 搭建kms服务器激活_自建KMS激活服务器的技巧
  19. (17)全民小视频引流脚本模块化开发11-关注用户的粉丝By飞云脚本学院
  20. mysql 如何 导入txt文件_mysql中导入txt文件


  1. 计算机术语新年祝福,新年祝福的短句
  2. WSL(ubuntu2204)xfce4安装中文环境和fcitx5框架及中文输入法
  3. 「游戏建模」如何展UV和设置光滑组?
  4. 小米电视连接树莓派samba
  5. 2.1.4奈式准则和香农定理
  6. UEC++:UKismetMathLibrary::RandomPointInBoundingBox盒体内取随机坐标的方法
  7. 【解决方案】RTSP协议视频平台EasyNVR建立智能水库管理系统,打造智慧水库
  8. TokenInsight BTC永续合约流动性实时数据 | TokenInsight
  9. 三维翼智发光字3D字壳打印机有什么功能?
  10. iPhone通讯录号码不见了怎么恢复