提取mdx字典文件中的数据

1.使用GetDict将.mdx文件转换为.txt文件

得到的文件：

2.数据库设计

CREATE TABLE `word` (`wid` int(11) NOT NULL AUTO_INCREMENT,`word_en` varchar(255) DEFAULT NULL,`star` varchar(255) DEFAULT NULL,PRIMARY KEY (`wid`)
) ENGINE=InnoDB AUTO_INCREMENT=34415 DEFAULT CHARSET=utf8

CREATE TABLE `sentences` (`sid` int(11) NOT NULL AUTO_INCREMENT,`sentence_ch` text,`sentence_en` text,`iid` int(11) DEFAULT NULL,PRIMARY KEY (`sid`)
) ENGINE=InnoDB AUTO_INCREMENT=90608 DEFAULT CHARSET=utf8

CREATE TABLE `see_also` (`sid` int(11) NOT NULL AUTO_INCREMENT,`number` varchar(255) DEFAULT NULL,`word_en` varchar(255) DEFAULT NULL,`wid` varchar(255) DEFAULT NULL,PRIMARY KEY (`sid`)
) ENGINE=InnoDB AUTO_INCREMENT=2159 DEFAULT CHARSET=utf8

CREATE TABLE `items` (`iid` int(11) NOT NULL AUTO_INCREMENT,`number` int(11) DEFAULT NULL,`label` varchar(255) DEFAULT NULL,`word_ch` varchar(255) DEFAULT NULL,`explanation` text,`gram` varchar(255) DEFAULT NULL,`wid` int(11) DEFAULT NULL,PRIMARY KEY (`iid`)
) ENGINE=InnoDB AUTO_INCREMENT=64250 DEFAULT CHARSET=utf8

CREATE TABLE `en_tip` (`eid` int(11) NOT NULL AUTO_INCREMENT,`tip` text,`iid` int(11) DEFAULT NULL,PRIMARY KEY (`eid`)
) ENGINE=InnoDB AUTO_INCREMENT=2460 DEFAULT CHARSET=utf8

3.使用Python提取其中的数据，并存入数据库

# -*- coding:utf-8 -*-
import pymysql
import json
from lxml import etreeword_list = []def getdata(filename):num = 0f = open(filename, 'r', encoding='utf-8')s = f.readline()while s != "":word = {}object = etree.HTML(s)word_en = object.xpath('//span[@class="C1_word_header_word"]/text()')[0]print("正在解析：", word_en)word.update({"word_en": word_en})star = object.xpath('//span[@class="C1_word_header_star"]/text()')if len(star) == 0:word.update({"star": ""})else:word.update({"star": star[0]})explanation_items = object.xpath('//div[@class="C1_explanation_item"]')items = []for explanation_item in explanation_items:item = {}explanation_box = explanation_item.xpath('div[@class="C1_explanation_box"]')if len(explanation_box) == 0:continueexplanation_box = explanation_box[0]# 解释编号item_number = explanation_box.xpath('span[@class="C1_item_number"]//text()')if len(item_number) == 0:item.update({"number": ""})else:item.update({"number": item_number[0]})# 标签 listexplanation_label_list = explanation_box.xpath('span[@class="C1_explanation_label"]/text()')if len(explanation_label_list) != 0:  # 正常的item# 标签explanation_label = explanation_label_list[0]item.update({"label": explanation_label})# 单词的中文意思if len(explanation_box.xpath('span[@class="C1_text_blue"]/text()')) != 0:word_ch = explanation_box.xpath('span[@class="C1_text_blue"]/text()')[0]item.update({"word_ch": word_ch})else:item.update({"word_ch": ""})# 单词的解释explanation = ''.join(explanation_box.xpath('text()|span[@class="C1_inline_word"]/text()'))item.update({"explanation": explanation})# 单词语法word_gram = explanation_box.xpath('span[@class="C1_word_gram"]/text()')if len(word_gram) == 0:item.update({"word_gram": ""})else:item.update({"word_gram": word_gram[0]})# 例句sentences = explanation_item.xpath('ul/li')sentence_list = []en_tip_list = []for sentence in sentences:# en_tipif len(sentence.xpath('p')) == 0:en_tip = ''.join(sentence.xpath('.//text()'))en_tip_list.append(en_tip)elif len(sentence.xpath('p')) == 2:sentence_dict = {}# 英文例句sentence_en = ''.join(sentence.xpath('p[@class="C1_sentence_en"]//text()'))sentence_dict.update({"sentence_en": sentence_en})# 中文翻译if len(sentence.xpath('p[2]//text()')) != 0:sentence_ch = sentence.xpath('p[2]//text()')[0]sentence_dict.update({"sentence_ch": sentence_ch})else:sentence_dict.update({"sentence_ch": ""})sentence_list.append(sentence_dict)item.update({"sentences": sentence_list})item.update({"en_tip": en_tip_list})else:  # See alsosee_also = explanation_box.xpath('b[@class="C1_text_blue"]//text()')item.update({"see_also": see_also})items.append(item)word.update({"items": items})num = num + 1print("已解析：", num)word_list.append(word)s = f.readline()def import_data():# 连接数据库，获取游标con = pymysql.connect(host='localhost', port=3306, user='root', password='root', db='kelinsi_dict', charset='utf8')cur = con.cursor()num2 = 0for word in word_list:print("正在存入：", word.get("word_en"))sql = "insert into word(word_en, star)values(\"%s\",\"%s\")" % (pymysql.escape_string(word.get("word_en")), pymysql.escape_string(word.get("star")))cur.execute(sql)con.commit()last_wid = cur.lastrowidfor item in word.get("items"):if item.get("see_also"):for see_word in item.get("see_also"):sql = "insert into see_also(number,word_en,wid)VALUES(\"%s\", \"%s\", %d)" % (pymysql.escape_string(item.get("number")), pymysql.escape_string(see_word), last_wid)cur.execute(sql)con.commit()elif item.get("word_ch"):sql = "insert into items(number,label,word_ch,explanation,gram,wid)VALUES(\"%s\", \"%s\", \"%s\", \"%s\", \"%s\", %d) " % (pymysql.escape_string(item.get("number")), pymysql.escape_string(item.get("label")), pymysql.escape_string(item.get("word_ch")), pymysql.escape_string(item.get("explanation")), pymysql.escape_string(item.get("word_gram")), last_wid)cur.execute(sql)con.commit()last_iid = cur.lastrowid# sentencefor sentence in item.get("sentences"):sql = "insert into sentences(sentence_ch, sentence_en, iid)VALUES(\"%s\", \"%s\", %d)" % (pymysql.escape_string(sentence.get("sentence_ch")), pymysql.escape_string(sentence.get("sentence_en")), last_iid)cur.execute(sql)con.commit()for en_tip in item.get("en_tip"):sql = "insert into en_tip(tip, iid)VALUES(\"%s\", %d)" % (pymysql.escape_string(en_tip), last_iid)cur.execute(sql)con.commit()num2 = num2 + 1print("已存入：", num2)cur.close()con.close()if __name__ == '__main__':getdata("kelinsi.txt")import_data()

4.字典文件及数据库文件

链接：https://pan.baidu.com/s/1e4TwCAwcioBGzYP2j76AXg
提取码：128q

提取mdx字典文件中的数据相关推荐

使用Python调用mdx字典文件进行查词
简介本文只是记录一下,自己用python从mdx字典文件中批量提取单词和所需部分词义的代码. 如果你是需要自己打包制作或编辑mdx文件,可以去pdawiki论坛,那里有完整的字典制作专区,可以了解方 ...
python怎么读取csv的一部分数据_python批量读取csv文件如何用python将csv文件中的数据读取成数组...
如何用python把多个csv文件数据处理后汇总到新csv文件你看这月光多温柔,小编转头还能看见你,一切从未坍塌. 可以用pandas读取数据,首先把文件方同一个文件价里,然后对当前文件价的所有内容循 ...
python读取文件中的数据为二维数组变量_Numpy 多维数据数组的实现
numpy包(模块)几乎总是用于Python中的数值计算.这个软件包为Python提供了高性能的向量.矩阵.张量数据类型.它是在C和Fortran中创建的,因此当计算被矢量化(用矩阵和矢量表示操作)时 ...
【Python】从文件中读取数据
从文件中读取数据 1.1 读取整个文件要读取文件,需要一个包含几行文本的文件(文件PI_DESC.txt与file_reader.py在同一目录下) PI_DESC.txt 3.1415926535 ...
从文件中读取一个long型数_Python 从文件中读取数据
问题:在python中如何从文件中读取数据,比如有一个mydata.txt文件包含10000行,50列的数据,想提取某几列出来,比如1, 3,5列. 方法一,编一个读取数据的函数. import js ...
python从文件中读取数据_【Python】从文件中读取数据
从文件中读取数据 1.1 读取整个文件要读取文件,需要一个包含几行文本的文件(文件PI_DESC.txt与file_reader.py在同一目录下) PI_DESC.txt 3.1415926535 ...
python读取xls数据_python_从.mat与.xls类型文件中读取数据
从.xls类型文件中读取数据在写机器学习算法的时候从UCI下载了一些数据,但是格式不是csv,而是.txt/.data,可以先用excel打开数据,在excel中将数据进行分列后导入python进行 ...
matlab如何读取excel文件中的数据?_Python自动化之从Excel文件读取数据
前言: 在Python语言,常用的excel读写库有xrld和openpyxl两个,当然pandas库也可以从excel文件中读取数据,但这里不建议使用.有个问题就是,xrld只能用于读取数据而不能用 ...
Python读取excel文件中的数据，绘制折线图、散点图
https://www.cnblogs.com/liulinghua90/p/9935642.html https://blog.csdn.net/qq_32458499/article/detail ...

提取mdx字典文件中的数据

1.使用GetDict将.mdx文件转换为.txt文件

得到的文件：

2.数据库设计

3.使用Python提取其中的数据，并存入数据库

4.字典文件及数据库文件

提取mdx字典文件中的数据相关推荐

最新文章

热门文章