Google原生输入法LatinIME词库构建流程分析--相关数据结构分析

其实输入法词库相关数据结构的定义基本上都在头文件dictdef.h文件中，进入到代码目录cpp下.

初始化字库,首先读取txt文件内容到数据结构lemma_arr和valid_hzs中,lemma_arr是一个数组类型为LemmaEntry,下面来看下LemmaEntry定义(cpp/include/dictdef.h):

//rawdict_utf16_65105_freq.txt每一行是一个LemmaEntry实体
//在记录拼音的时候，它默认将拼音字母转成大写，仅对双声母中的h使用小写。 这里指的是pinyin_str
struct LemmaEntry {LemmaIdType idx_by_py;LemmaIdType idx_by_hz;char16 hanzi_str[kMaxLemmaSize + 1];// The SingleCharItem id for each Hanzi.uint16 hanzi_scis_ids[kMaxLemmaSize];uint16 spl_idx_arr[kMaxLemmaSize + 1];char pinyin_str[kMaxLemmaSize][kMaxPinyinSize + 1];unsigned char hz_str_len;float freq;
};

首先来看下rawdict_utf16_65105_freq.txt文件内容：

鼥 0.750684002197 1 ba
釛 0.781224156844 1 ba
軷 0.9691786136 1 ba
釟 0.9691786136 1 ba
蚆 1.15534975655 1 ba
。。。。。。

可以看到该文件行数为65105，每一行的格式都是：汉字频率？拼音，结构体中的freq就是频率，hz_str_len就是汉字的长度，二维数组pinyin_str[8][7]用来存放拼音，限制最长汉字串长度为8，单个汉字拼音长度限定为7，hanzi_str[8+1]是用来存放汉字的一种unicode编码，如第一个字“鼥”的编码就是：40741，可以在这里转换，在gdb中查看lemma_arr_第一个元素如下：

{idx_by_py = 0, idx_by_hz = 0, hanzi_str =     {40741,0,0,0,0,0,0,0,0}, hanzi_scis_ids =     {0,0,0,0,0,0,0,0}, spl_idx_arr =     {0,0,0,0,0,0,0,0,0}, pinyin_str =     {      "BA\000\000\000\000","\000\000\000\000\000\000","\000\000\000\000\000\000","\000\000\000\000\000\000","\000\000\000\000\000\000","\000\000\000\000\000\000","\000\000\000\000\000\000","\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 0.750684023}

拼音都被转换成了大写，但是双声母中的h除外。hanzi_scis_ids字段对应的是该lemma每个汉字在单个汉字表scis中对应的id，比如：第一行的“鼥”字在单字表scis中对应的id就被赋值给了hanzi_scis_ids的第一个元素即hanzi_scis_ids[0]位置，最后一行为“欧洲市场”，那么该字段对应的数组中依次存放“欧” “洲” “市” “场”所对应的id，假如“欧” “洲” “市” “场”分别对应id为34、46、29、200，那么hanzi_scis_ids[0] = 34、hanzi_scis_ids[1] = 46、hanzi_scis_ids[2] = 29、hanzi_scis_ids[3] = 200，其余值仍为初始值0，spl_idx_arr字段描述了每个LemmaEntry中每个汉字字音的id，在gdb中跳过50000次执行后正好跳到“叫声”这个词组，打印看到lemma_arr_中的结构：

(gdb) p lemma_arr_[8117]
$31 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {21483, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {166, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"JIAO\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 55685.8672}
(gdb) p lemma_arr_[12781]
$32 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {22768, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {337, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ShENG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 7169.11719}
(gdb) p lemma_arr_[i-1]
$33 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {21483, 22768, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {166, 337, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"JIAO\000\000", "ShENG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 2 '\002', freq = 564.171448}
(gdb) p i
$34 = 33619
(gdb)

8117为“叫”这个字在lemma_arr_中的存储，12781为“声”这个字的lemma_arr_存储，i-1 = 33618，可以看到lemma_arr[33618]处“叫声”的idx_by_hz和spl_idx_arr就是其单个字的组合而来，同时还可以看到双声母“ShENG”中的h被设置为了小写。idx_by_py和idx_by_hz字段分别表示该lemma通过汉字id数组和拼音id数组计算出来的lemma的id。

前面提到过，LemmaEntry字段hanzi_scis_ids字段表示lemma（称之为汉字串吧）中每个汉字在单汉字表中的id，单汉字表scis也是一个数组，其类型为SingleCharItem的结构体（cpp/include/dictdef.h）：

#ifdef ___BUILD_MODEL___
struct SingleCharItem {float freq;char16 hz;SpellingId splid;
};

字段splid对应了单个汉字的拼音id，hz即汉字的描述，freq字段描述该单个汉字的频率，具体代码在dictbuilder.cpp中

size_t hz_num = lemma_arr_[pos].hz_str_len;
...
if (1 == hz_num)scis_[scis_num_].freq = lemma_arr_[pos].freq;elsescis_[scis_num_].freq = 0.000001;

汉字num为1也就是lemma只有一个汉字,否则freq设置为0.000001。再来看splid类型为SpellingId,这也是一个结构体：

typedef struct {uint16 half_splid:5;uint16 full_splid:11;
} SpellingId, *PSpellingId;

此结构体定义的half_splid和full_splid使用了位字段进行定义，即half_splid可以存储的无符号short类数不大于31（最大为11111），而full_splid可以存储最大无符号short类形数（大于31）小于2的0次方累加到2的11次方（具体多少自己算吧）。

接下来看与构建字典树相关的数据结构，一个是LmaNodeLE0，另一个是LmaNodeGE1，它们分别代表层数小于等于0上的节点和层数大于1上的节点，先来看LmaNodeLE0的定义（cpp/include/dictdef.h）：

/*** We use different node types for different layers* Statistical data of the building result for a testing dictionary:*                              root,   level 0,   level 1,   level 2,   level 3* max son num of one node:     406        280         41          2          -* max homo num of one node:      0         90         23          2          2* total node num of a layer:     1        406      31766      13516        993* total homo num of a layer:     9       5674      44609      12667        995** The node number for root and level 0 won't be larger than 500* According to the information above, two kinds of nodes can be used; one for* root and level 0, the other for these layers deeper than 0.** LE = less and equal,* A node occupies 16 bytes. so, totallly less than 16 * 500 = 8K*/
struct LmaNodeLE0 {uint32 son_1st_off;uint32 homo_idx_buf_off;uint16 spl_idx;uint16 num_of_son;uint16 num_of_homo;
};/*** GE = great and equal* A node occupies 8 bytes.*/
struct LmaNodeGE1 {uint16 son_1st_off_l;        // Low bits of the son_1st_offuint16 homo_idx_buf_off_l;   // Low bits of the homo_idx_buf_off_1uint16 spl_idx;unsigned char num_of_son;            // number of son nodesunsigned char num_of_homo;           // number of homo wordsunsigned char son_1st_off_h;         // high bits of the son_1st_offunsigned char homo_idx_buf_off_h;    // high bits of the homo_idx_buf_off
};

结构体LmaNodeGE0和LmaNodeGE1结构体主要在dictbuilder::construct_subset(...)方法中调用，从buidl_dict方法调用时传入的item_start=0, item_end=65101,就是rawdict_utf16_65101_freq.txt文件第一行到最后一行，也就是遍历lemma_arr_这个数组生成分层的trie树，在方法construtc_subset中递归调用自己为level0和level1层上添加节点，这块具体结构形式还没弄太明白，等以后明白了再详细描述这两个结构体吧。

// Node used for the trie of spellings
struct SpellingNode {SpellingNode *first_son;// The spelling id for each node. If you need more bits to store// spelling id, please adjust this structure.uint16 spelling_idx:11;uint16  num_of_son:5;char char_this_node;unsigned char score;
};

结构体SpellingNode用来描述拼音字典树的每个节点，此结构体定义在cpp/include/spellingtrie.h中，*first_son是一个指向儿子节点类型为SpellingNode的指针数组首地址，spellingtrie.cpp的construct方法中构建的音节树的每个节点都是此类型,root_节点的first_son指针指向level1_sons的首元素地址，num_of_son采用位字段来定义，说明它可以存放不大于2的5次方的整数，该字段用来描述以此char_this_node描述的char可以组成的音节数量，如：当char_this_node 为 ‘A‘时，它的儿子节点数为3，分别是ai an ao。字段score即此char的得分，score越小搜索优先级越高，root_节点的score=0，位字段spelling_idx描述每个可组成音节的字母在列表中的id：

{first_son = 0x617420, spelling_idx = 1, num_of_son = 3, char_this_node = 65 'A', score = 86 'V'},{first_son = 0x617480, spelling_idx = 2, num_of_son = 5, char_this_node = 66 'B', score = 57 '9'},{first_son = 0x617620, spelling_idx = 3, num_of_son = 6, char_this_node = 67 'C', score = 72 'H'},{first_son = 0x6179e0, spelling_idx = 5, num_of_son = 5, char_this_node = 68 'D', score = 46 '.'},{first_son = 0x617c50, spelling_idx = 6, num_of_son = 3, char_this_node = 69 'E', score = 79 'O'},{first_son = 0x617cb0, spelling_idx = 7, num_of_son = 5, char_this_node = 70 'F', score = 72 'H'},{first_son = 0x617e00, spelling_idx = 8, num_of_son = 4, char_this_node = 71 'G', score = 62 '>'},{first_son = 0x617ff0, spelling_idx = 9, num_of_son = 4, char_this_node = 72 'H', score = 64 '@'},{first_son = 0x6181e0, spelling_idx = 11, num_of_son = 2, char_this_node = 74 'J', score = 59 ';'},{first_son = 0x618380, spelling_idx = 12, num_of_son = 4, char_this_node = 75 'K', score = 70 'F'},{first_son = 0x618570, spelling_idx = 13, num_of_son = 6, char_this_node = 76 'L', score = 62 '>'},{first_son = 0x618810, spelling_idx = 14, num_of_son = 5, char_this_node = 77 'M', score = 68 'D'},{first_son = 0x6189e0, spelling_idx = 15, num_of_son = 6, char_this_node = 78 'N', score = 66 'B'},{first_son = 0x618c70, spelling_idx = 16, num_of_son = 1, char_this_node = 79 'O', score = 109 'm'},{first_son = 0x618c90, spelling_idx = 17, num_of_son = 5, char_this_node = 80 'P', score = 90 'Z'},{first_son = 0x618e50, spelling_idx = 18, num_of_son = 2, char_this_node = 81 'Q', score = 66 'B'},{first_son = 0x626ff0, spelling_idx = 19, num_of_son = 5, char_this_node = 82 'R', score = 65 'A'},{first_son = 0x6271a0, spelling_idx = 20, num_of_son = 6, char_this_node = 83 'S', score = 46 '.'},{first_son = 0x627540, spelling_idx = 22, num_of_son = 5, char_this_node = 84 'T', score = 70 'F'},{first_son = 0x6277a0, spelling_idx = 25, num_of_son = 4, char_this_node = 87 'W', score = 61 '='},{first_son = 0x627890, spelling_idx = 26, num_of_son = 2, char_this_node = 88 'X', score = 68 'D'},{first_son = 0x627a30, spelling_idx = 27, num_of_son = 5, char_this_node = 89 'Y', score = 51 '3'},{first_son = 0x627bd0, spelling_idx = 28, num_of_son = 6, char_this_node = 90 'Z', score = 61 '='}

但是该字段为11位，也就是说它可以存放不大于2的11次方的整数（2048）。

Google原生输入法LatinIME词库构建流程分析--相关数据结构分析相关推荐

Google原生输入法LatinIME词库构建流程分析(二)
在Google原生输入法LatinIME词库构建流程分析(一) 中分析LatinIME构建流程进行到了dict_trie->dict_list_->init_list这一步,然后就是构建N ...
Google原生输入法LatinIME词库构建流程分析(三)--N-gram信息构建
N-gram信息的构建在ngram.cpp中进行构建: bool NGram::build_unigram(LemmaEntry *lemma_arr, size_t lemma_num,LemmaI ...
Google原生输入法LatinIME词库扩容（Windows10环境）
去年在Linux(ubuntu)环境下针对LatinIME进行词库扩容处理,针对LatinIME的词库构建进行了一些列分析,大家可以查阅历史文章.词库扩容最近试了一下是可以的,具体流程大致如下(win ...
Google原生输入法LatinIME引擎初始化流程分析(二)
引擎初始化首先是在Java层调用native的初始化方法,Java层调用如下: private void initPinyinEngine() {byte usr_dict[];usr_dict = ...
ios 输入法扩展_如何给iOS系统原生输入法导入词库
一.越狱版 1. 设置 - 通用 - 键盘 - 文本替换随便添加一条内容,例如"nihao 你好" 2. 在 iFile 或 iFilza 根目录下搜索"CloudUs ...
在Android原生输入法LatinIME中添加自定义按键
由于项目需求,需要修改android系统原生输入法.以下修改的是源码中的LatinIME/java工程. 示例添加的是隐藏软键盘的按键,具体的该在哪个位置添加,进入到相应的文件就明白了. A.将hid ...
java做app流程图,Android App 构建流程分析
我们平时在android studio中点击run ,就能把代码编译成一个apk文件并安装到手机上.那么这个过程中都具体发生了什么 ?我们是怎么把代码和资源文件打包成一个apk文件,并安装到手机上的呢 ...
gnocchi-采样数据存储流程分析(002)--数据的异步统计
1 数据的异步统计在gnocchi的核心思想中,是通过后台的异步处理ceilometer发送过来的采样数据,然后根据存储策略定义的汇聚方式,对数据进行预处理.然后用户获取统计数据的时候,直接获取到对 ...
把搜狗输入法词库导入Google拼音输入法
为PC端Google拼音输入法增加词库为什么折腾词库都在说百度.讯飞等输入法上传用户词库,为了安全建议大家使用google输入法之类,话说回来,要想使用智能联想功能是不是就得把你输入习惯放在他的里 ...

Google原生输入法LatinIME词库构建流程分析--相关数据结构分析

Google原生输入法LatinIME词库构建流程分析--相关数据结构分析相关推荐

最新文章

热门文章