Google原生输入法LatinIME词库构建流程分析(二)

在Google原生输入法LatinIME词库构建流程分析(一) 中分析LatinIME构建流程进行到了dict_trie->dict_list_->init_list这一步，然后就是构建N-gram信息了，N-gram构建过程在Google原生输入法LatinIME词库构建流程分析(三)--N-gram信息构建中进行了分析，那么接下来继续：

bool DictBuilder::build_dict(const char *fn_raw,const char *fn_validhzs,DictTrie *dict_trie) {
...
// Construct the NGram informationNGram& ngram = NGram::get_instance();ngram.build_unigram(lemma_arr_, lemma_num_,lemma_arr_[lemma_num_ - 1].idx_by_hz + 1);//按照spl_idx_arr排序，id一样的话按照freq字段排序（compare_py）// sort the lemma items according to the spelling idx stringmyqsort(lemma_arr_, lemma_num_, sizeof(LemmaEntry), compare_py);get_top_lemmas();#ifdef ___DO_STATISTICS___stat_init();
#endiflma_nds_used_num_le0_ = 1;  // The root nodebool dt_success = construct_subset(static_cast<void*>(lma_nodes_le0_),lemma_arr_, 0, lemma_num_, 0);if (!dt_success) {free_resource();return false;}
...
}

myqsort这句是对lemma_arr数组进行排序，排序规则为：按照spl_idx_arr先进行比较，如果相等，按照freq字段来排序，接下来调用get_top_lemmas()来初始化数组top_lmas_数组，数组长度为10，这里的top是指按照freq字段从大到小的前10个元素：

{{idx_by_py = 0, idx_by_hz = 8505, hanzi_str = {30340, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {8508, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {91, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"DE\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 4828294.5}, {idx_by_py = 0, idx_by_hz = 114, hanzi_str = {20102, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {114, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {200, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"LE\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 1500186}, {idx_by_py = 0, idx_by_hz = 4196, hanzi_str = {25105, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {4198, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {375, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"WO\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 1192789.25}, {idx_by_py = 0, idx_by_hz = 5084, hanzi_str = {26159, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {5087, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {338, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ShI\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 1180957}, {idx_by_py = 0, idx_by_hz = 1955, hanzi_str = {22312, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {1957, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {407, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ZAI\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 974740.062}, {idx_by_py = 0, idx_by_hz = 308, hanzi_str = {20320, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {308, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {251, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"NI\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 973526.125}, {idx_by_py = 0, idx_by_hz = 1406, hanzi_str = {21644, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {1407, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {148, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"HE\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 664874.125}, {idx_by_py = 0, idx_by_hz = 5254, hanzi_str = {26377, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {5257, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {401, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"YOU\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 613906.75}, {idx_by_py = 0, idx_by_hz = 13, hanzi_str = {19981, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {13, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {50, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"BU\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 590643.062}, {idx_by_py = 0, idx_by_hz = 2961, hanzi_str = {23601, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {2963, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {171, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"JIU\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 558432.875}}

top_lmas_数组初始化完成后调用stat_init()函数来初始化下一步(construct_subset)需要用到的一些数据结构,stat_init:

#ifdef ___DO_STATISTICS___
void DictBuilder::stat_init() {memset(max_sonbuf_len_, 0, sizeof(size_t) * kMaxLemmaSize);memset(max_homobuf_len_, 0, sizeof(size_t) * kMaxLemmaSize);memset(total_son_num_, 0, sizeof(size_t) * kMaxLemmaSize);memset(total_node_hasson_, 0, sizeof(size_t) * kMaxLemmaSize);memset(total_sonbuf_num_, 0, sizeof(size_t) * kMaxLemmaSize);memset(total_sonbuf_allnoson_, 0, sizeof(size_t) * kMaxLemmaSize);memset(total_node_in_sonbuf_allnoson_, 0, sizeof(size_t) * kMaxLemmaSize);memset(total_homo_num_, 0, sizeof(size_t) * kMaxLemmaSize);sonbufs_num1_ = 0;sonbufs_numgt1_ = 0;total_lma_node_num_ = 0;
}

很明显，这里设置相关数组元素和变量为0从而完成相关数据结构的初始化操作。重点逻辑在接下来的construct_subset（）方法中，

Google原生输入法LatinIME词库构建流程分析(二)相关推荐

Google原生输入法LatinIME词库构建流程分析(三)--N-gram信息构建
N-gram信息的构建在ngram.cpp中进行构建: bool NGram::build_unigram(LemmaEntry *lemma_arr, size_t lemma_num,LemmaI ...
Google原生输入法LatinIME词库构建流程分析--相关数据结构分析
其实输入法词库相关数据结构的定义基本上都在头文件dictdef.h文件中,进入到代码目录cpp下. 初始化字库,首先读取txt文件内容到数据结构lemma_arr和valid_hzs中,lemma_a ...
Google原生输入法LatinIME词库扩容（Windows10环境）
去年在Linux(ubuntu)环境下针对LatinIME进行词库扩容处理,针对LatinIME的词库构建进行了一些列分析,大家可以查阅历史文章.词库扩容最近试了一下是可以的,具体流程大致如下(win ...
Google原生输入法LatinIME引擎初始化流程分析(二)
引擎初始化首先是在Java层调用native的初始化方法,Java层调用如下: private void initPinyinEngine() {byte usr_dict[];usr_dict = ...
ios 输入法扩展_如何给iOS系统原生输入法导入词库
一.越狱版 1. 设置 - 通用 - 键盘 - 文本替换随便添加一条内容,例如"nihao 你好" 2. 在 iFile 或 iFilza 根目录下搜索"CloudUs ...
在Android原生输入法LatinIME中添加自定义按键
由于项目需求,需要修改android系统原生输入法.以下修改的是源码中的LatinIME/java工程. 示例添加的是隐藏软键盘的按键,具体的该在哪个位置添加,进入到相应的文件就明白了. A.将hid ...
Android构建流程——篇二
文章目录预操作任务列表如何查看一个task类 Task1: checkDebugClasspath 1. input/output 2. 如何找到任务实现类 3. 核心类(AppClasspat ...
【Android 启动过程】Activity 启动源码分析 ( ActivityThread 流程分析二 )
文章目录前言一.ActivityManagerService.attachApplicationLocked 二.ActivityStackSupervisor.attachApplication ...
mysql 8.0 一条insert语句的具体执行流程分析(二)
继续上一篇文章:mysql 8.0 一条insert语句的具体执行流程分析(一)_一缕阳光的博客-CSDN博客由于最近换工作一直在试用期内,在拼命的学习.总结中,因此没有时间写文章,今天转正了腾出来 ...

Google原生输入法LatinIME词库构建流程分析(二)

Google原生输入法LatinIME词库构建流程分析(二)相关推荐

最新文章

热门文章