Google原生输入法LatinIME词库构建流程分析(一)

进入到cpp目录下(pwd=.../cpp/),在command目录中有个pinyinime_dictbuilder.cpp文件,源码中可以看到main函数,这里就是词库构建的入口,接下来看下main函数源码:

 25 /**26  * Build binary dictionary model. Make sure that ___BUILD_MODEL___ is defined27  * in dictdef.h.28  */29 int main(int argc, char* argv[]) {30   DictTrie* dict_trie = new DictTrie();31   bool success;32   if (argc >= 3)33      success = dict_trie->build_dict(argv[1], argv[2]);34   else35      success = dict_trie->build_dict("../data/rawdict_utf16_65105_freq.txt",36                                      "../data/valid_utf16.txt");37 38   if (success) {39     printf("Build dictionary successfully.\n");40   } else {41       printf("Build dictionary unsuccessfully.\n");42     return -1;43   }44 45   success = dict_trie->save_dict("../../res/raw/dict_pinyin.dat");46 47   if (success) {48     printf("Save dictionary successfully.\n");49   } else {50     printf("Save dictionary unsuccessfully.\n");51     return -1;52   }53 54   return 0;55 }

根据注释来看,该函数是用来构建字库模型的,缺省状态下执行第35行,用data下面的两个utf16的txt文件来作为源进行build,如果构建成功,第45行会进行保存,保存的目录以及文件格式通过代码很容易找到了,这是个二进制文件,其内部实际上是一些列的数据结构组合而成,构建词库模型的核心逻辑就在第35行调用的build_dict方法中,接下来进一步跟进(./share/dicttrie.cpp):

103 #ifdef ___BUILD_MODEL___
104 bool DictTrie::build_dict(const char* fn_raw, const char* fn_validhzs) {
105   DictBuilder* dict_builder = new DictBuilder();
106
107   free_resource(true);
108
109   return dict_builder->build_dict(fn_raw, fn_validhzs, this);
110 }

第105行创建了一个dict_builder,这个builder就是实际用来构建的对象,创建完对象后调用了free_resource方法用来释放一下创建字典过程中用到的数据结构,这些数据结构其实就是最终保存字典的时候需要保存的对象,通过fwrite方法写入到文件dict_pinyin.dat中,释放完相关数据结构以后调用builder的build_dict方法,开始构建过程(./share/dictbuilder.cpp):

bool DictBuilder::build_dict(const char *fn_raw,const char *fn_validhzs,DictTrie *dict_trie) {...lemma_num_ = read_raw_dict(fn_raw, fn_validhzs, 240000);...spl_buf = spl_table_->arrange(&spl_item_size, &spl_num);...// 把所有合法音节组织成一个Trie树 construct方法if (!spl_trie.construct(spl_buf, spl_item_size, spl_num,spl_table_->get_score_amplifier(),spl_table_->get_average_score())) {free_resource();return false;}printf("spelling tree construct successfully.\n");// 填充lemma_arr_数组每个元素的spl_idx_arr项，它表示每个汉字的音对应的spl_id// Convert the spelling string to idxsfor (size_t i = 0; i < lemma_num_; i++) {for (size_t hz_pos = 0; hz_pos < (size_t)lemma_arr_[i].hz_str_len;hz_pos++) {...int spl_idx_num =spl_parser_->splstr_to_idxs(lemma_arr_[i].pinyin_str[hz_pos],strlen(lemma_arr_[i].pinyin_str[hz_pos]),spl_idxs, spl_start_pos, 2, is_pre);...if (spl_trie.is_half_id(spl_idxs[0])) {uint16 num = spl_trie.half_to_full(spl_idxs[0], spl_idxs);assert(0 != num);}lemma_arr_[i].spl_idx_arr[hz_pos] = spl_idxs[0];}}...
// Sort the lemma items according to the hanzi, and give each unique item a// id// 按照汉字串排序，更新idx_by_hz字段，为每个词分配一个唯一idsort_lemmas_by_hz();
// 构建单字表到scis_，并根据该单字表更新lemma_arr_中的hanzi_scis_ids字段scis_num_ = build_scis();// Construct the dict listdict_trie->dict_list_ = new DictList();bool dl_success = dict_trie->dict_list_->init_list(scis_, scis_num_,lemma_arr_, lemma_num_);assert(dl_success);// Construct the NGram informationNGram& ngram = NGram::get_instance();ngram.build_unigram(lemma_arr_, lemma_num_,lemma_arr_[lemma_num_ - 1].idx_by_hz + 1);...lma_nds_used_num_le0_ = 1;  // The root nodebool dt_success = construct_subset(static_cast<void*>(lma_nodes_le0_),lemma_arr_, 0, lemma_num_, 0);...if (kPrintDebug0) {printf("Building dict succeds\n");}return dt_success;
}

构建字典的主要逻辑都在这个方法中，代码中只保留了主要逻辑方法可以直观的看出构建的具体过程，首先从两个文件中读取内容到对应的数据结构中，raw这个文件内容被保存到lemma_arr_数组中，但是通过打印lemma_num的值发现lemma_arr数组中实际元素个数为65101个，但是raw_dict_utf16_65101.txt文件为65105行，为什么会少四个呢？通过对for循环中的continue挂断点发现问题在read_raw_dict函数中：

// The whole line must have been parsed fully, otherwise discard this one.token = utf16_strtok(to_tokenize, &token_size, &to_tokenize);if (spelling_not_support || NULL != token) {i--;continue;}

spelling_not_support为true，通过打印当前索引i发现文件中确实存在非法的拼音，如6557行的：

哼 2072.17903804 0 hng

以及17035、17036、17037行：

噷 6.18262663209 1 hm
唔 1126.6237397 0 ng
嗯 31982.2903695 0 ng

所以正好是四个元素。

validhanzi这个内容保存到valid_hzs数组中打印valid_hzs数组如下所示：

(gdb) ptype valid_hzs
type = unsigned short *
(gdb) p valid_hzs
$1 = (ime_pinyin::char16 *) 0x627190
(gdb) p *valid_hzs@10
$2 = {12295, 19968, 19969, 19971, 19975, 19976, 19977, 19978, 19979, 19980}
(gdb)

指针类型的数组，这里只是打印了前十个元素，第一个12295就是汉字“〇”的Unicode编码，可以在这里验证，lemma_arr_已经在Google原生输入法LatinIME词库构建流程分析--相关数据结构分析这篇文章中打印过了，这两个数组是后面构建词库的基础，read_raw_ditct()方法读取完数据后调用了spl_table->arrange方法，此方法返回一个指针数组spl_buf，该数组为有效汉语音节总数，长度为413，并且是经过排序的，然后调用spl_trie->construct方法，构建所有合法音节（413个）的字典树，传入的数据依次为音节数组spl_buf、数组元素长度、数组长度、用于计算每个音节score（得分）的放大器以及一个平均score，具体值如下：

(gdb) p spl_num
$7 = 413
(gdb) p spl_item_size
$8 = 8
(gdb) p spl_table_->get_score_amplifier()
$9 = -14.1073904
(gdb) p spl_table_->get_average_score()
$10 = 100 'd'
(gdb)

这里的节点是用结构体SpellingNode来描述的：

(gdb) ptype first_son
type = struct ime_pinyin::SpellingNode {ime_pinyin::SpellingNode *first_son;ime_pinyin::uint16 spelling_idx : 11;ime_pinyin::uint16 num_of_son : 5;char char_this_node;unsigned char score;
} *
(gdb)

spl_trie构建的字典有两层，第0层是root_即level0层,第1层是root_的儿子节点即level1层，通过打印spl_trie的堆栈信息如下：

spl_trie = @0x6160a0: {static kMaxYmNum = 64, static kValidSplCharNum = 26, static kHalfIdShengmuMask = 1, static kHalfIdYunmuMask = 2, static kHalfIdSzmMask = 4, static kHalfId2Sc_ = "0ABCcDEFGHIJKLMNOPQRSsTUVWXYZz", static char_flags_ = 0x615140 <ime_pinyin::SpellingTrie::char_flags_> "\006\005\005\005\006\005\005\005", static instance_ = 0x6160a0, spelling_buf_ = 0x616510 "A", spelling_size_ = 8, spelling_num_ = 413, score_amplifier_ = -14.1073904, average_score_ = 100 'd', spl_ym_ids_ = 0x628350 "", ym_buf_ = 0x628eb0 "A", ym_size_ = 6, ym_num_ = 33, splstr_queried_ = 0x617200 "ZhUO", splstr16_queried_ = 0x617220, root_ = 0x617240, dumb_node_ = 0x617260, splitter_node_ = 0x617280, level1_sons_ = {0x6172a0, 0x6172b0, 0x6172c0, 0x6172d0, 0x6172e0, 0x6172f0, 0x617300, 0x617310, 0x0, 0x617320, 0x617330, 0x617340, 0x617350, 0x617360, 0x617370, 0x617380, 0x617390, 0x6173a0, 0x6173b0, 0x6173c0, 0x0, 0x0, 0x6173d0, 0x6173e0, 0x6173f0, 0x617400}, h2f_start_ = {0, 30, 35, 51, 67, 86, 109, 114, 124, 143, 0, 162, 176, 195, 221, 241, 266, 268, 285, 299, 313, 329, 348, 0, 0, 368, 377, 391, 406, 423}, h2f_num_ = {0, 5, 16, 35, 19, 23, 5, 10, 19, 19, 0, 14, 19, 26, 20, 25, 2, 17, 14, 14, 35, 19, 20, 0, 0, 9, 14, 15, 37, 20}, f2h_ = 0x627fb0, node_num_ = 496}
__PRETTY_FUNCTION__ = "bool ime_pinyin::DictBuilder::build_dict(const char*, const char*, ime_pinyin::DictTrie*)"

其中root_的地址=0x617240,root_也是一个类型位SpellingNode的结构体，通过gdb进一步打印其first_son:

(gdb) p spl_trie->root_.first_son
$58 = (ime_pinyin::SpellingNode *) 0x6172a0
(gdb)

level1_sons_ = {0x6172a0, 0x6172b0, 0x6172c0, 0x6172d0, 0x6172e0, 0x6172f0, 0x617300, 0x617310, 0x0, 0x617320, 0x617330, 0x617340, 0x617350, 0x617360, 0x617370, 0x617380, 0x617390, 0x6173a0, 0x6173b0, 0x6173c0, 0x0, 0x0, 0x6173d0, 0x6173e0, 0x6173f0, 0x617400}

root_的first_son地址是0x6172a0，这个地址其实就是level1首元素的地址，根节点直接指向level1首元素，而level1的长度=26，其char_this_node正是从a~z的大写字母：

(gdb) p spl_trie->level1_sons_[0].char_this_node
$60 = 65 'A'
(gdb) p spl_trie->level1_sons_[1].char_this_node
$61 = 66 'B'
(gdb) p spl_trie->level1_sons_[3].char_this_node
$62 = 68 'D'
(gdb) p spl_trie->level1_sons_[2].char_this_node
$63 = 67 'C'
(gdb) p spl_trie->level1_sons_[4].char_this_node
$64 = 69 'E'
(gdb) p spl_trie->level1_sons_[25].char_this_node
$65 = 90 'Z'
(gdb) p spl_trie->level1_sons_[26].char_this_node
Cannot access memory at address 0x330023001e000a
(gdb)

但是并不是每个字母都可以作为声母的，如level1_sons_的第9个元素正好是‘I’，所以它的地址为0x0。

然后我们再看下level1_sons_首元素有几个儿子节点呢？

(gdb) p spl_trie->level1_sons_[0].num_of_son
$67 = 3
(gdb)

对，三个，哪三个呢？其实答案就在spl_buf_数组中，要验证此也不难，继续往下跟就是了：

(gdb) p spl_trie->level1_sons_[0].first_son[0].char_this_node
$70 = 73 'I'
(gdb) p spl_trie->level1_sons_[0].first_son[1].char_this_node
$71 = 78 'N'
(gdb) p spl_trie->level1_sons_[0].first_son[2].char_this_node
$72 = 79 'O'

分别是ai an ao，其实an往下还有呢，就是ang，到这里spl_trie中构建的树结构就明晰了，然后我们回过头来看一下root_节点的first_son指针，其实它是个指针数组，其内容为

{first_son = 0x617420, spelling_idx = 1, num_of_son = 3, char_this_node = 65 'A', score = 86 'V'},{first_son = 0x617480, spelling_idx = 2, num_of_son = 5, char_this_node = 66 'B', score = 57 '9'},{first_son = 0x617620, spelling_idx = 3, num_of_son = 6, char_this_node = 67 'C', score = 72 'H'},{first_son = 0x6179e0, spelling_idx = 5, num_of_son = 5, char_this_node = 68 'D', score = 46 '.'},{first_son = 0x617c50, spelling_idx = 6, num_of_son = 3, char_this_node = 69 'E', score = 79 'O'},{first_son = 0x617cb0, spelling_idx = 7, num_of_son = 5, char_this_node = 70 'F', score = 72 'H'},{first_son = 0x617e00, spelling_idx = 8, num_of_son = 4, char_this_node = 71 'G', score = 62 '>'},{first_son = 0x617ff0, spelling_idx = 9, num_of_son = 4, char_this_node = 72 'H', score = 64 '@'},{first_son = 0x6181e0, spelling_idx = 11, num_of_son = 2, char_this_node = 74 'J', score = 59 ';'},{first_son = 0x618380, spelling_idx = 12, num_of_son = 4, char_this_node = 75 'K', score = 70 'F'},{first_son = 0x618570, spelling_idx = 13, num_of_son = 6, char_this_node = 76 'L', score = 62 '>'},{first_son = 0x618810, spelling_idx = 14, num_of_son = 5, char_this_node = 77 'M', score = 68 'D'},{first_son = 0x6189e0, spelling_idx = 15, num_of_son = 6, char_this_node = 78 'N', score = 66 'B'},{first_son = 0x618c70, spelling_idx = 16, num_of_son = 1, char_this_node = 79 'O', score = 109 'm'},{first_son = 0x618c90, spelling_idx = 17, num_of_son = 5, char_this_node = 80 'P', score = 90 'Z'},{first_son = 0x618e50, spelling_idx = 18, num_of_son = 2, char_this_node = 81 'Q', score = 66 'B'},{first_son = 0x626ff0, spelling_idx = 19, num_of_son = 5, char_this_node = 82 'R', score = 65 'A'},{first_son = 0x6271a0, spelling_idx = 20, num_of_son = 6, char_this_node = 83 'S', score = 46 '.'},{first_son = 0x627540, spelling_idx = 22, num_of_son = 5, char_this_node = 84 'T', score = 70 'F'},{first_son = 0x6277a0, spelling_idx = 25, num_of_son = 4, char_this_node = 87 'W', score = 61 '='},{first_son = 0x627890, spelling_idx = 26, num_of_son = 2, char_this_node = 88 'X', score = 68 'D'},{first_son = 0x627a30, spelling_idx = 27, num_of_son = 5, char_this_node = 89 'Y', score = 51 '3'},{first_son = 0x627bd0, spelling_idx = 28, num_of_son = 6, char_this_node = 90 'Z', score = 61 '='}

在root_的first_son数组中存放的元素就是level1_sons_中存放的元素，虽然地址不同（不是同一个指针）但是存放的内容是相同的，至此spl_trie构建的树的结构就逐渐明晰了：

上图只是用char_this_node来简要说明一下spl_trie中构建的树结构，其实每个节点是结构体SpellingNode对象。然后再往下看for循环中：

// 填充lemma_arr_数组每个元素的spl_idx_arr项，它表示每个汉字的音对应的spl_id// Convert the spelling string to idxsfor (size_t i = 0; i < lemma_num_; i++) {for (size_t hz_pos = 0; hz_pos < (size_t)lemma_arr_[i].hz_str_len;hz_pos++) {...int spl_idx_num =spl_parser_->splstr_to_idxs(lemma_arr_[i].pinyin_str[hz_pos],strlen(lemma_arr_[i].pinyin_str[hz_pos]),spl_idxs, spl_start_pos, 2, is_pre);...if (spl_trie.is_half_id(spl_idxs[0])) {uint16 num = spl_trie.half_to_full(spl_idxs[0], spl_idxs);assert(0 != num);}lemma_arr_[i].spl_idx_arr[hz_pos] = spl_idxs[0];}}

外层for循环从0到65101，其实就是遍历lemma_arr_数组，内层for循环遍历每个lemma的汉字对应的拼音，进一步调用spl_parser->splstr_to_idx()方法实现汉字拼音到id的映射，具体映射过程在下一篇博客再说。

for循环结束以后开始构建单汉字表，构建之前先对lemma_arr_按照汉字排序并给每个词分配一个唯一的id：

// Sort the lemma items according to the hanzi, and give each unique item a// id// 按照汉字串排序，更新idx_by_hz字段，为每个词分配一个唯一idsort_lemmas_by_hz();
// 构建单字表到scis_，并根据该单字表更新lemma_arr_中的hanzi_scis_ids字段scis_num_ = build_scis();

然后是构建单汉字表scis，但是在构建之前先对lemma_arr数组进行了排序操作，即按照汉字对应的unicode十进制数来排序，这里只打印lemma_arr数组前10个元素：

{{idx_by_py = 0, idx_by_hz = 1, hanzi_str = {12295, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {1, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {210, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"LING\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 248.484543}, {idx_by_py = 0, idx_by_hz = 2, hanzi_str = {19968, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {2, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {396, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"YI\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 134392.703}, {idx_by_py = 0, idx_by_hz = 3, hanzi_str = {19969, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {3, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {100, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"DING\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 4011.11377}, {idx_by_py = 0, idx_by_hz = 4, hanzi_str = {19969, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {4, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {431, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ZhENG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 3.37402463}, {idx_by_py = 0, idx_by_hz = 5, hanzi_str = {19971, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {5, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {285, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"QI\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 6313.39502}, {idx_by_py = 0, idx_by_hz = 6, hanzi_str = {19975, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {6, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {238, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"MO\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 4.85489225}, {idx_by_py = 0, idx_by_hz = 7, hanzi_str = {19975, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {7, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {370, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"WAN\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 25941.043}, {idx_by_py = 0, idx_by_hz = 8, hanzi_str = {19976, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {8, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {426, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ZhANG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 305.971039}, {idx_by_py = 0, idx_by_hz = 9, hanzi_str = {19977, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {9, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {315, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"SAN\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 26761.9336}, {idx_by_py = 0, idx_by_hz = 10, hanzi_str = {19978, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {10, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {332, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ShANG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 284918.875}
}

构建scis完成后每个汉字对应一个id，使用此id来更新lemma_arr_中的hanzi_scis_ids字段，构建完成的scis表大小和内容如下（只打印了前十个）：

(gdb) p scis_num
$141 = 17038
(gdb) p *scis@10
$142 =   {{freq = 0, hz = 0, splid = {half_splid = 0, full_splid = 0}},{freq = 248.484543, hz = 12295, splid = {half_splid = 13, full_splid = 210}},{freq = 134392.703, hz = 19968, splid = {half_splid = 27, full_splid = 396}},{freq = 4011.11377, hz = 19969, splid = {half_splid = 5, full_splid = 100}},{freq = 3.37402463, hz = 19969, splid = {half_splid = 29, full_splid = 431}},{freq = 6313.39502, hz = 19971, splid = {half_splid = 18, full_splid = 285}},{freq = 4.85489225, hz = 19975, splid = {half_splid = 14, full_splid = 238}},{freq = 25941.043, hz = 19975, splid = {half_splid = 25, full_splid = 370}},{freq = 305.971039, hz = 19976, splid = {half_splid = 29, full_splid = 426}},{freq = 26761.9336, hz = 19977, splid = {half_splid = 20, full_splid = 315}}}
(gdb)

valid_uft16.txt文件中总汉字个数位16466个，而scis表中除了第0个还剩17037个，为什么会比valid_utf16.txt中多呢?因为有的字有多个读音，如‘丨’字，既都‘gun’，又读‘e’,还可以读成‘shu’，所以scis中的总数要比valid_utf16.txt中要多。然后是从lemma_arr_数组来进行初始化字典列表，并且lemma_arr_数组是经过按照汉字排序过的，id也是从1开始分派好的，调用如下：

 // Construct the dict listdict_trie->dict_list_ = new DictList();bool dl_success = dict_trie->dict_list_->init_list(scis_, scis_num_,lemma_arr_, lemma_num_);assert(dl_success);

init_list方法中进一步调用了fill_scis和fill_list方法：

#ifdef ___BUILD_MODEL___
bool DictList::init_list(const SingleCharItem *scis, size_t scis_num,const LemmaEntry *lemma_arr, size_t lemma_num) {if (NULL == scis || 0 == scis_num || NULL == lemma_arr || 0 == lemma_num)return false;initialized_ = false;if (NULL != buf_)free(buf_);// calculate the size 计算大小size_t buf_size = calculate_size(lemma_arr, lemma_num);if (0 == buf_size)return false;//分配资源if (!alloc_resource(buf_size, scis_num))return false;//填充scis_hz_和scis_splid_两个数组，数据来源scis数组。fill_scis(scis, scis_num);// Copy the related content from the array to inner bufferfill_list(lemma_arr, lemma_num);initialized_ = true;return true;
}

在fill_scis方法中就是一个for循环，把scis中的hz字段内容依次复制到scis_hz_数组中，同时复制scis的splid字段到scis_splid_数组中：

void DictList::fill_scis(const SingleCharItem *scis, size_t scis_num) {assert(scis_num_ == scis_num);for (size_t pos = 0; pos < scis_num_; pos++) {scis_hz_[pos] = scis[pos].hz;scis_splid_[pos] = scis[pos].splid;}
}

最终初始化的scis_hz_数组内容为（这里只是打印前十个元素为例）：

(gdb) p *scis_hz_@10
$152 =   {0,12295,19968,19969,19969,19971,19975,19975,19976,19977}

数组scis_splid内容为：

(gdb) p *scis_splid_@10
$154 =   {{half_splid = 0, full_splid = 0},{half_splid = 13, full_splid = 210},{half_splid = 27, full_splid = 396},{half_splid = 5, full_splid = 100},{half_splid = 29, full_splid = 431},{half_splid = 18, full_splid = 285},{half_splid = 14, full_splid = 238},{half_splid = 25, full_splid = 370},{half_splid = 29, full_splid = 426},{half_splid = 20, full_splid = 315}}
(gdb)

长度就是scis的长度了。在init_list函数中调用完fill_scis之后紧接着又调用了fill_list函数，在该函数中初始化了buf_这个数组：

void DictList::fill_list(const LemmaEntry* lemma_arr, size_t lemma_num) {size_t current_pos = 0;utf16_strncpy(buf_, lemma_arr[0].hanzi_str,lemma_arr[0].hz_str_len);current_pos = lemma_arr[0].hz_str_len;size_t id_num = 1;for (size_t i = 1; i < lemma_num; i++) {utf16_strncpy(buf_ + current_pos, lemma_arr[i].hanzi_str,lemma_arr[i].hz_str_len);id_num++;current_pos += lemma_arr[i].hz_str_len;}assert(current_pos == start_pos_[kMaxLemmaSize]);assert(id_num == start_id_[kMaxLemmaSize]);
}

传入的参数为lemma_arr数组和该数组长度，在前面的逻辑中已经对lemma_arr数组按照汉字进行了排序，最终buf_数组长度为150837，其存储内容为：

(gdb) p *buf_@10
$230 = {12295, 19968, 19969, 19969, 19971, 19975, 19975, 19976, 19977, 19978}

这些内容对应rawdict_utf16_65101_freq.txt文件中的汉字，即排序后的lemma_arr数组中的hanzi_str字段依次放在buf_数组中，但是lemma_arr数组中的hanzi_str字段有单个汉字、两个汉字、三个汉字以及四个汉字的情况，那也没用，就是依次放在buf_中，估计search的时候会去判断的，这里先留个悬念！这一步初始化了三个数组，分别是scis_hz_、scis_splid和buf_，只有当这三个数组都初始化成功即dl_success为true的时候下面的断言语句才可以通过，继续往下构建n-gram，n-gram信息的构建主要是为后期的预测功能提供支持，构建过程在另一篇文章中单独研究，这里先来看下DictBuilder::construct_subset方法：此方法从build_dict中调用时传入的参数item_start=0,item_end=65101,也就是说从文件rawdict_utf16_65101_freq.txt的第一行到最后一行进行遍历，从根节点开始构建字典树，默认从level0开始构建，

// 1. Scan for how many sonssize_t parent_son_num = 0;// LemmaNode *son_1st = NULL;// parent.num_of_son = 0;LemmaEntry *lma_last_start = lemma_arr_ + item_start;uint16 spl_idx_node = lma_last_start->spl_idx_arr[level];// Scan for how many sons to be allocaedfor (size_t i = item_start + 1; i< item_end; i++) {LemmaEntry *lma_current = lemma_arr + i;uint16 spl_idx_current = lma_current->spl_idx_arr[level];if (spl_idx_current != spl_idx_node) {parent_son_num++;spl_idx_node = spl_idx_current;}}parent_son_num++;

这个for循环就是来遍历lemma_arr_计算总共有多少个节点，什么条件下才增加一个节点呢？答案就是spl_idx_current != spl_idx_node这个条件成立，看过第一篇LatinIME数据解构分析的话应该能记得LemmaEntry中有一个字段就是spl_idx_arr,这个字段就是汉字的拼音在对应数据结构（spl_buf_）中拼音的id值，他们两个不相等就说明需要增加一个节点了，跳过for循环打印parent_son_num发现正好时413，也就是Google原生LatinIME输入法spl_buf_数据内容的行数，即有效汉语音节总数。

字典构建过程分析到此还并未结束，限于篇幅，将在流程分析（二）中继续对字典构建流程进行分析，文中如有纰漏、谬误，敬请指教！

Google原生输入法LatinIME词库构建流程分析(一)相关推荐

Google原生输入法LatinIME词库构建流程分析(二)
在Google原生输入法LatinIME词库构建流程分析(一) 中分析LatinIME构建流程进行到了dict_trie->dict_list_->init_list这一步,然后就是构建N ...
Google原生输入法LatinIME词库构建流程分析(三)--N-gram信息构建
N-gram信息的构建在ngram.cpp中进行构建: bool NGram::build_unigram(LemmaEntry *lemma_arr, size_t lemma_num,LemmaI ...
Google原生输入法LatinIME词库构建流程分析--相关数据结构分析
其实输入法词库相关数据结构的定义基本上都在头文件dictdef.h文件中,进入到代码目录cpp下. 初始化字库,首先读取txt文件内容到数据结构lemma_arr和valid_hzs中,lemma_a ...
Google原生输入法LatinIME词库扩容（Windows10环境）
去年在Linux(ubuntu)环境下针对LatinIME进行词库扩容处理,针对LatinIME的词库构建进行了一些列分析,大家可以查阅历史文章.词库扩容最近试了一下是可以的,具体流程大致如下(win ...
Google原生输入法LatinIME引擎初始化流程分析(二)
引擎初始化首先是在Java层调用native的初始化方法,Java层调用如下: private void initPinyinEngine() {byte usr_dict[];usr_dict = ...
ios 输入法扩展_如何给iOS系统原生输入法导入词库
一.越狱版 1. 设置 - 通用 - 键盘 - 文本替换随便添加一条内容,例如"nihao 你好" 2. 在 iFile 或 iFilza 根目录下搜索"CloudUs ...
在Android原生输入法LatinIME中添加自定义按键
由于项目需求,需要修改android系统原生输入法.以下修改的是源码中的LatinIME/java工程. 示例添加的是隐藏软键盘的按键,具体的该在哪个位置添加,进入到相应的文件就明白了. A.将hid ...
java做app流程图,Android App 构建流程分析
我们平时在android studio中点击run ,就能把代码编译成一个apk文件并安装到手机上.那么这个过程中都具体发生了什么 ?我们是怎么把代码和资源文件打包成一个apk文件,并安装到手机上的呢 ...
把搜狗输入法词库导入Google拼音输入法
为PC端Google拼音输入法增加词库为什么折腾词库都在说百度.讯飞等输入法上传用户词库,为了安全建议大家使用google输入法之类,话说回来,要想使用智能联想功能是不是就得把你输入习惯放在他的里 ...
新浪出输入法了，深蓝词库转换更新到1.3.1——增加对新浪拼音输入法的支持
新浪最近出了自己的输入法,具体介绍我就不说了,参见这里.由于之前一直做深蓝词库转换的工具,目前已经支持了大部分主流的输入法词库的转换,既然出了一个新的输入法,那么肯定要增加对这个输入法的词库的支持了. ...

Google原生输入法LatinIME词库构建流程分析(一)

Google原生输入法LatinIME词库构建流程分析(一)相关推荐

最新文章

热门文章