假期学习【十一】Python切词，以及从百度爬取词典

今天主要对从CSDN爬取的标题利用jieba（结巴）进行分词，但在分词过程中发现，如大数据被分成了大/数据，云计算被分隔成了云/计算。

后来又从百度百科---》信息领域爬取了相关词语作为词典，预计今天晚上完成切词任务。

其中分割代码如下：

 1 import jieba
 2 import io
 3
 4 #对句子进行分词
 5 def cut():
 6     f=open("E://luntan.txt","r+",encoding="utf-8")
 7     for line in f:
 8         seg_list=jieba.cut(line)
 9         #print(' '.join(seg_list))
10         for i in seg_list:
11             print(i)
12             write(i+" ")
13         #write(' '.join(seg_list))
14
15
16 #分词后写入
17 def write(contents):
18     f=open("E://luntan_cut.txt","a+",encoding="utf-8")
19     f.write(contents)
20     print("写入成功！")
21     f.close()
22
23 #创建停用词
24 def stopwordslist(filepath):
25     stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
26     return stopwords
27
28 # 对句子进行去除停用词
29 def seg_sentence(sentence):
30     sentence_seged = jieba.cut(sentence.strip())
31     stopwords = stopwordslist('E://stop.txt')  # 这里加载停用词的路径
32     outstr = ''
33     for word in sentence_seged:
34         if word not in stopwords:
35             if word != '\t':
36                 outstr += word
37                 outstr += " "
38     return outstr
39
40 #循环去除
41 def cut_all():
42     inputs = open('E://luntan_cut.txt', 'r', encoding='utf-8')
43     outputs = open('E//luntan_stop', 'w')
44     for line in inputs:
45         line_seg = seg_sentence(line)  # 这里的返回值是字符串
46         outputs.write(line_seg + '\n')
47     outputs.close()
48     inputs.close()
49
50 if __name__=="__main__":
51     cut()

分割后的文本

从百度爬取词典要把百度页面地址：https://baike.baidu.com/wikitag/taglist?tagId=76607

该页拉到最下，并存为本地mhtml格式，在浏览器打开然后右击查看源代码，源代码保存为txt格式文件，

代码如下：

 1 import io
 2 import re
 3
 4 patton=re.compile(r'title=".*"')
 5 def read():
 6     f=open("E://mhtml.txt","r+",encoding="utf-8")
 7     for line in f:
 8         line=line.rstrip("\n")
 9         m=patton.findall(line)
10         #print(line)
11         if len(m)!=0:
12             print(m)
13             write(str(m).lstrip("['title=\"").rstrip("\"']")+"\r")
14
15 def write(contents):
16     f=open("E://xinxi.txt","a+",encoding="utf-8")
17     f.write(contents)
18     print("写入成功！")
19     f.close()
20
21 if __name__=="__main__":
22     read()

效果：

假期学习【十一】Python切词，以及从百度爬取词典相关推荐

Python 3.6模拟输入并爬取百度前10页密切相关链接
1.安装扩展库mechanicalsoup,这个库依赖requests.beautifulsoup4等模块,一般会自动安装,如果失败的话,可以先安装依赖的其他扩展库. 2.分析百度网页源代码,找到用来 ...
python爬取图片教程-推荐|Python 爬虫系列教程一爬取批量百度图片
Python 爬虫系列教程一爬取批量百度图片https://blog.csdn.net/qq_40774175/article/details/81273198# -*- coding: utf-8 ...
Python编程语言学习：python中与数字相关的函数(取整等)、案例应用之详细攻略
Python编程语言学习:python中与数字相关的函数(取整等).案例应用之详细攻略目录 python中与数字相关的函数 1.对小数进行向上取整 1.1.利用numpy库 1.2.利用math库
python爬取知乎话题广场_学习python爬虫---爬虫实践：爬取B站排行榜2（爬取全部分类排行榜、利用pygal库作图）...
前面我们爬取了B站上全站的排行榜,详细见:魏勇:学习python爬虫---爬虫实践:爬取B站排行榜zhuanlan.zhihu.com 一.爬取全部分类排行榜我们观察一下B站排行榜,那里还有番剧排 ...
Python爬虫菜鸟入门，爬取豆瓣top250电影（自己学习，如有侵权，请联系我删除）
Python爬虫菜鸟入门,爬取豆瓣top250电影 (自己学习,如有侵权,请联系我删除) import requests from bs4 import BeautifulSoup import ti ...
python爬虫爬取百度图片,python爬虫篇2：爬取百度图片
入门级 import requests import re import os from urllib import error def main(): dirPath = "E:\pyth ...
基于python的微博热搜爬取及数据分析
刚学python爬虫,用爬虫爬取新浪微博热搜,看看效果如何,也是对这段时间学习python的总结. 一.目的: 抓取新浪微博2020年1月3日星期五的热搜榜,将抓取到的数据进行动态展示,并生成当天的微 ...
《python爬虫实战》：爬取贴吧上的帖子
<python爬虫实战>:爬取贴吧上的帖子经过前面两篇例子的练习,自己也对爬虫有了一定的经验. 由于目前还没有利用BeautifulSoup库,因此关于爬虫的难点还是正则表达式的书写. ...
菜鸟Python实战-03爬虫之爬取数据
最近想学习一下爬虫所以参考了一下网上的代码,并加以理解和整理,好记性不如烂笔头吧. 以下代码的目标网站是豆瓣电影:https://movie.douban.com/top250?start=%22( ...

假期学习【十一】Python切词，以及从百度爬取词典

假期学习【十一】Python切词，以及从百度爬取词典相关推荐

最新文章

热门文章