SQLite源代码分析----------分词器②

2021SC@SDUSC

文章目录

简介
代码分析

简介

承接上文SQLite源代码分析----------分词器①，接下来我们介绍Tokenizer的另外一个模块：Porter_Tokenizer；
除了“simple”分词器之外，FTS源代码还提供了一个使用波特词干算法（porter stemming algorithm）的分词器。此分词器使用相同的规则将输入文档分隔为术语，包括将所有术语折叠为小写，但也使用波特词干算法将相关的英语单词简化为公共词根。例如，使用与上面一段相同的输入文档，Porter分词器提取以下标记：“right now thei veri frustrat”。尽管其中一些术语甚至不是英语单词，但在某些情况下，使用它们构建全文索引比简单标记器产生的更容易理解的输出更有用。使用波特标记器，文档不仅匹配全文查询，如"MATCH ‘Frustrated’"，还匹配查询如 “MATCH ‘Frustration’”，因为”Frustration“这个词被波特词干算法简化为”frustrat“就像”Frustrated“一样。因此，在使用波特分词器时，FTS不仅能够找到查询术语的精确匹配，而且能够找到与类似的英语术语匹配的词。
举例说明“simple”和“porter”分词器之间的区别：

-- Create a table using the simple tokenizer. Insert a document into it.
CREATE VIRTUAL TABLE simple USING fts3(tokenize=simple);
INSERT INTO simple VALUES('Right now they''re very frustrated');-- The first of the following two queries matches the document stored in
-- table "simple". The second does not.
SELECT * FROM simple WHERE simple MATCH 'Frustrated';
SELECT * FROM simple WHERE simple MATCH 'Frustration';-- Create a table using the porter tokenizer. Insert the same document into it
CREATE VIRTUAL TABLE porter USING fts3(tokenize=porter);
INSERT INTO porter VALUES('Right now they''re very frustrated');-- Both of the following queries match the document stored in table "porter".
SELECT * FROM porter WHERE porter MATCH 'Frustrated';
SELECT * FROM porter WHERE porter MATCH 'Frustration';

代码分析

Poter_Tokenizer是一个包装分词器。它接受其他令牌处理器的输出，并应用波特词干算法（porter stemming algorithm）返回给FTS 5之前的每个令牌。这使得像“correction”这样的搜索词能够匹配类似的词，如“corrected”或“correcting”。波特词干算法是专为英语词汇使用而设计的–将它与其他语言一起使用可能会或不可能改进搜索工具。默认情况下，Porter令牌程序充当默认令牌器(Unicode 61)的包装器。如果将一个或多个额外的参数添加到“Porter”之后的“Tokenize”选项中，则将它们视为PorterStemmer使用的底层令牌程序的规范。例如：

(-- Two ways to create an FTS5 table that uses the porter tokenizer to
-- stem the output of the default tokenizer (unicode61).
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = porter);
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter unicode61');-- A porter tokenizer used to stem the output of the unicode61 tokenizer,
-- with diacritics removed before stemming.
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter unicode61 remove_diacritics 1');)

在fts_poter.c文件中有如下定义：

typedef struct porter_tokenizer_cursor {sqlite3_tokenizer_cursor base;const char *zInput;          /* 正在标记的输入 */int nInput;                  /* 输入的大小 */int iOffset;                 /* zInput 的位置*/int iToken;                  /* 返回的下一个令牌的索引 */char *zToken;                /* 当前令牌的存储 */int nAllocated;              /* 分配给zToken缓冲区的空间 */
} porter_tokenizer_cursor;

该结构体是由sqlite3_tokenizer_cursor衍生出的类。

static void copy_stemmer(const char *zIn, int nIn, char *zOut, int *pnOut){int i, mx, j;int hasDigit = 0;for(i=0; i<nIn; i++){char c = zIn[i];if( c>='A' && c<='Z' ){zOut[i] = c - 'A' + 'a';}else{if( c>='0' && c<='9' ) hasDigit = 1;zOut[i] = c;}}mx = hasDigit ? 3 : 10;if( nIn>mx*2 ){for(j=mx, i=nIn-mx; i<nIn; i++, j++){zOut[j] = zOut[i];}i = j;}zOut[i] = 0;*pnOut = i;
}

输入输入词zIn[0…nIn-1]，将输出存储在zOut中，ZOut至少大到可以容纳nIn字节，写实际的输出字的大小（不包括“\0”终止符）到*pnOut。US-ASCII字符集中的任何大写字符（[A-Z]）被转换为小写字母，大写UTF字符不变，长度约20字节的单词被保留从单词的开头和结尾开始的几个字节；如果单词包含数字，取从结尾开始的3个字节；对于没有数字的长单词，需要10个字节从两端都取出来。US-ASCII的折叠案例仍然适用；如果输入字不包含数字，但字符不包含在[a-zA-Z]中，没有任何引导，这个程序只是使用US-ASCII将输入复制到输出中。即模板永远不会增加单词的长度，所以有没有机会溢出zOut缓冲区。