python输入法引擎_Bigram-MLE语言模型和模拟输入法的python实现

1 importre2 importjsonlines3

5 #训练语料：metadata.txt

6 #生成文档： 1. d.jsonl文件, 2. output.jsonl文件 3. generator_5possible_value.jsonl文件 4. data.txt文件

10 new_ll=[] #词汇表：存储单词

11 dict_file={} #cut函数中，对每一行数据切分后,产生的中间存储变量, 字典格式：dict{the:[boy,boy, girl,apple, apple,...], boy:[like,like, eat, play,play.....], like:[eatting, playing.....], eatting:[apple,apple.... ].....}

12 total_list=[] #全部单词(包含重复单词)

13 d={} #存储每个单词以及它的统计频数, d{read:13, the:10, a:12, book: 15,......... }

14 frequency_dict={} #用于封装key_value{}的中间存储变量。

15 key_value={} #在output.jsonl文件中按行存储的字典

16 ###############################################################################################################

17 key_5value={} #用于封装value{}的中间存储变量。

18 word_num=5 #每个单词后最可能出现的单词的数目。

19 value={} #generator_5possible_value.jsonl文件中按行存储的字典

23 #count_list()函数：用于统计词汇表中的单词

24 #new_ll中存储词汇表中的单词

25 defcount_list(list_file):26 #用一个列表记录总共多少种单词

27 globalnew_ll28 for i inlist_file:29 if i not innew_ll:30 new_ll.append(i)31

34 #存储字典d (字典中包含键值对，例如：{read:13, the:10, a:12, book: 15,......... },生成d.jsonl文件。

35 defsave_dict(line):36 with jsonlines.open('i:\d.jsonl', mode='w') as writer:37 writer.write(line)38

41 #遍历total_list列表，对列表中的单个单词进行统计数目，将结果存入d[]字典内，例如：d{ read : 20, the : 30,....}

42 defcount_total_list(total_list):43 #用一个字典记录结果 ,遍历列表 , 求count()

44 globalnew_ll45 for i innew_ll:46 d[i]=total_list.count(i) #统计某个单词在训练语料中出现的频数。

47 save_dict(d) #存储d字典

51 #对一行数据进行切分，例如：the boy like eatting apple, 切分为：{the:[boy], boy:[like], like:[eatting], eatting:[apple].......}

52 #如果dict_file中存在某一个key=result_list[i],将result_list[i+1]添加到dict_file[result_list[i]]的列表中。

53 #构建dict_file字典，字典最终格式：dict{the:[boy,boy, girl,apple, apple,...], boy:[like,like, eat, play,play.....], like:[eatting, playing.....], eatting:[apple,apple.... ].....}

54 defcut(result_list):55 globaldict_file56 for i in range(len(result_list)-1):57 if result_list[i] not indict_file:58 dict_file[result_list[i]]=[result_list[i+1]]59 else:60 dict_file[result_list[i]].append(result_list[i+1])61

63 #对frequency_count()函数产生的key_value{}进行存储。

64 #以json格式按行存储字典元素(字典元素中包含键值对，每条json数据格式：key_value{read :{ the:10, a:12, book: 15 } } ,即以read开头的，二元语法及频数。

65 #生成output.jsonl文件

66 defsave_key_value(line):67 with jsonlines.open('i:\output.jsonl', mode='a') as writer:68 writer.write(line)69

72 #对new_ll(词汇表)中各个单词，挑选其后最可能出现的五个单词，生成generator_5possible_value.jsonl文件, 用于模拟输入法程序。

73 #value字典的格式形式： value { the:30, a:20, book:15, pen:12, apple:2}

74 defgenerator_5possible_value(i,data):75 globalvalue,key_5value76 data_list =list(data.keys())77 length =len(data_list)78 if length <79 for count inrange value key_5value else:83 with jsonlines.open mode="a" as writer:87 writer.write>

88 key_5value={} #用于封装value{}的中间存储变量。

89 value={}90

94 #对于new_ll中的每一个单词,例如:read，查看是否在dict_file字典中存在key==i，如果存在，对于dict_file[i]中每个j,构建frequency_dict[name]=key，其中name=str(j),key=dict_file[i].count(j)

95 #对 frequency_dict中的对象，按照频数排序。

96 #再对frequency_dict进行封装，即：key_value[i]=data

97 #最终key_value字典格式为：key_value{read: {the:10, a:12, book:5,....} } 及二元语法中，read后出现的单词，以及各自的频数。

98 deffrequency_count():99 globalkey_value,frequency_dict,new_ll100 for i in new_ll: #new_ll中每个单词

101 if i in dict_file: #判断是否在dict_file中存在key==i

102 for j indict_file[i]:103 name=str(j)104 key=dict_file[i].count(j)105 frequency_dict[name]=key106

107 data=dict(sorted(frequency_dict.items(), key=lambda d: d[1], reverse=True))#对 frequency_dict中的对象，按照频数排序。

108 #print(data)

109 key_value[i] =data #对frequency_dict进行封装

110 generator_5possible_value(i,data) #对new_ll中各个单词，挑选其后最可能出现的五个单词

111 save_key_value(key_value) #对分装好的key_value进行存储

112 frequency_dict={}113 key_value ={}114 #frequency.append(frequency_dict)

115 #frequency_dict = {}

116

117

118

119

120 defmain():121 with open('i:\metadata.txt',"r",encoding='utf-8') as f: #设置文件对象

122 with open('i:\data.txt', 'w') as out:123 for line in f.readlines(): #依次读取每行

124 str_ =str(line)125 #print(str_)

126 str_=str_[::-1]127 count=str_.find("|")128 #print(str_)

129 #print(count)

130 str_=str_[0:count]131 str_=str_[::-1]132 result_list = re.findall('[a-zA-Z0-9]+', str_)133

134 result_list.insert(0,"BOS")135 result_list.append("EOS")136 #以上两行代码，对metadata.txt中第三列的每个句子处理为格式：result_list[ 'BOS','the','boy','is','a','very','great','child','EOS']

137 for i inresult_list:138 total_list.append(i)#统计所有单词，包括重复单词。

139 #print(result_list)

140 count_list(result_list)141 cut(result_list)142 out.writelines(result_list) #生成data.txt文件，文件中储存处理后的句子，每个句子处理为格式：result_list[ 'BOS','the','boy','is','a','very','great','child','EOS']。

143 out.writelines("\n")144 #以上部分代码，负责对metadata.txt中第三列进行处理，total_list中存储切分后的所有单词(包括重复单词).

145 count_total_list(total_list) #遍历total_list列表，对列表中的单个单词进行统计数目，将结果存入d[]字典内，例如：d{ read : 20, the : 30,....}

146 print("字典"+str(d))147 print(len(d))148 frequency_count() #生成key_value字典, 最终key_value字典格式为：key_value{read: {the:10, a:12, book:5,....} } 及二元语法中，read后出现的单词，以及各自的频数。

149 print("字符" +str(new_ll))150 print(len(new_ll))151 print("键-值" +str(dict_file))152 print(len(dict_file))153 print(len(total_list))154 out.close()155 f.close()156

157

158

159

160 if __name__ == "__main__":161 main()

79>

python输入法引擎_Bigram-MLE语言模型和模拟输入法的python实现相关推荐

Python游戏引擎开发（六）：动画的小小研究
今天我们来研究动画,其实这个动画就是一个Sprite+Bitmap的结合体.不造什么是Sprite和Bitmap?=__=#看来你是半路杀进来的,快去看看前几章吧: Python游戏引擎开发(一):序 ...
快速搭建一款输入法(封装输入法引擎)
输入法最核心的是输入法引擎,输入法引擎负责加载和管理输入法配置和输入法的词库,输入法引擎对用户输入的拼音字符串进行处理并返回对应的候选列表.通过引入输入法引擎我们就可以将我们输入法的拼音串转换成对应的 ...
python游戏引擎3d_一个人独立开发 3D 游戏引擎可能吗？
当然可以,但难道有个引擎,就可以做出真正商业化的游戏么?而且国产游戏大部分是网游啊. 几年前的老文--<一个人的服务器端>(只是为了说明游戏开发难度,不是针对题主问题.) 技术准备能够做 ...
python 模板引擎对比_Python Web开发模板引擎优缺点总结
做 Web 开发少不了要与模板引擎打交道.我陆续也接触了 Python 的不少模板引擎,感觉可以总结一下了. 一.首先按照我的熟悉程度列一下: pyTenjin:我在开发 Doodle 和 91 外教 ...
Python爬虫实战（5）：模拟登录淘宝并获取所有订单
Python爬虫入门(1):综述 Python爬虫入门(2):爬虫基础了解 Python爬虫入门(3):Urllib库的基本使用 Python爬虫入门(4):Urllib库的高级用法 Python爬虫 ...
Python爬虫实战之（五）| 模拟登录wechat
作者:xiaoyu 微信公众号:Python数据科学知乎:Python数据分析师不知何时,微信已经成为我们不可缺少的一部分了,我们的社交圈.关注的新闻或是公众号.还有个人信息或是隐私都被绑定在了一 ...
Python爬虫实战之（五）| 模拟登录wechat 1
作者:xiaoyu 微信公众号:Python数据科学知乎:Python数据分析师不知何时,微信已经成为我们不可缺少的一部分了,我们的社交圈.关注的新闻或是公众号.还有个人信息或是隐私都被绑定在了一 ...
解决Python开发中，Pycharm中无法使用中文输入法问题
解决Python开发中,Pycharm中无法使用中文输入法问题参考文章: (1)解决Python开发中,Pycharm中无法使用中文输入法问题 (2)https://www.cnblogs.com/ ...
Python小姿势 - # Python网络爬虫之如何通过selenium模拟浏览器登录微博
Python网络爬虫之如何通过selenium模拟浏览器登录微博微博登录接口很混乱,需要我们通过selenium来模拟浏览器登录. 首先我们需要安装selenium,通过pip安装: ``` pip ...

python输入法引擎_Bigram-MLE语言模型和模拟输入法的python实现

python输入法引擎_Bigram-MLE语言模型和模拟输入法的python实现相关推荐

最新文章

热门文章