python 爬取上海体育彩票文章标题、时间、内容

python期末大作业爬取上海体育彩票文章标题、时间、内容并计算词频、生成特殊形状的词云图

利用selenium爬取内容代码:

# https://www.shsportslottery.com/shsportsweb/html/tycp/lottery_shxw/List/list_0.htm
from selenium import webdriver
import requests
import csv
import time
from bs4 import BeautifulSoupurl1="https://www.shsportslottery.com/shsportsweb/html/tycp/lottery_shxw/List/list_0.htm"driver=webdriver.Chrome()
driver.get(url1)
# time.sleep(3)while True:time.sleep(3)content=driver.page_sourcesoup=BeautifulSoup(content,"lxml")news=soup.find(class_="news_list")for  new in news.find_all("li"):times=new.find(class_="color").get_text()title=new.find(name="a").get_text()link=new.find(name="a").get("href")#         url2=https://www.shsportslottery.com/shsportsweb/html/tycp/lottery_shxw/2021-06-10/Detail_160252.htmurl2="https://www.shsportslottery.com/shsportsweb/html/tycp/"+link[6:]print(times,title,url2)headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36 Edg/91.0.864.41'}con=requests.get(url=url2,headers=headers)con.encoding="utf-8"html2=con.text#         print(html2)soup2=BeautifulSoup(html2,"lxml")with open("shSportslottery.csv","a",newline="",encoding="utf-8")as file:writer=csv.writer(file)writer.writerow((times,title))#         print(times,title)try:contents=soup2.find(class_="news_l").find("tr")for con in contents.find_all("span"):content=con.get_text()with open("SContent.txt","a+",encoding="utf-8") as file:file.write((content))with open("shSportslottery.csv","a",newline="",encoding="utf-8")as file:writer=csv.writer(file)writer.writerow([content])except:content="none"#         print(content)
#try exccept很重要，因为有一些文章可能会以h5的形式写或者只有视频、图片，会导致找不到定位标签，系统出错...(问就是吃过亏...)next=driver.find_element_by_xpath('//a[contains(text(),"后一页")]')if next.get_attribute("href"):next.click()else:breakdriver.quit()
print("爬取完毕！")

2. 计算词频运行截图：

代码：

#词频统计import jieba
excludes={"体育","记者","编辑","据了解"}txt=open("SContent.txt","r",encoding="utf-8-sig").read()
words=jieba.lcut(txt)
counts={}for word in words:if len(word)==1:continueelse:counts[word]=counts.get(word,0)+1items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)for i in range(15):word,count=items[i]print("{0:<10}{1:>5}".format(word,count))

3. 生成词云代码：

#生成词云
import jieba
import wordcloud
from imageio import imread
excludes={"体育","记者","编辑","据了解"}
f=open("SContent.txt","r",encoding="utf-8")
t=f.read()
f.close
ls=jieba.lcut(t)
txt="".join(ls)
mask=imread("play.png")w=wordcloud.WordCloud(width=1000,height=700,background_color="white",stopwords=excludes,font_path="ssdf.ttf",mask=mask)
w.generate(txt)
w.to_file("STCPcon.png")
print("已完成！")

生成词云图：

总结视频：

python 爬取上海体育彩票文章标题、时间、内容相关推荐

python爬取公众号文章发布时间
使用xpath取出来的是空,爬取到本地的html,时间的标签如下,内容也是是空的 <em id="publish_time" class="rich_media_m ...
python爬取公众号文章如何获取发布时间
python爬取公众号文章如何获取发布时间在上一篇爬取公众号的文章中爬虫如何爬取微信公众号文章介绍了如何获取公众号的所有历史文章链接,但当我根据链接去爬取文章的时候,却遇到了一个小问题,就是文章的发 ...
Python爬取书包网文章实战总结
python爬取书包网文章总结今天闲来无事去看小说,但是发现没办法直接下载,所以呢就用python爬虫来下载一波了,哈哈- 爬取的是这篇小说:剑破九天(是不是很霸气,话不多说,开始-) 总体思路步骤 ...
如何用python爬取公众号文章_如何使用 Python 爬取微信公众号文章
我比较喜欢看公众号,有时遇到一个感兴趣的公众号时,都会感觉相逢恨晚,想一口气看完所有历史文章.但是微信的阅读体验挺不好的,看历史文章得一页页的往后翻,下一次再看时还得重复操作,很是麻烦. 于是便想着能 ...
python爬取今日头条文章json中data出现none_Python3爬取今日头条有关《人民的名义》文章...
最近一直在看Python的基础语法知识,五一假期手痒痒想练练,正好<人民的名义>刚结束,于是决定扒一下头条上面的人名的名义文章,试试技术同时可以集中看一下大家的脑洞也是极好的. 首先,我们 ...
Python 爬取51cto博客标题浏览量、评论量、收藏
介绍提到爬虫,互联网的朋友应该都不陌生,现在使用Python爬取网站数据是非常常见的手段,好多朋友都是爬取豆瓣信息为案例,我不想重复,就使用了爬取51cto博客网站信息为案例,这里以我的博客页面为教 ...
python 爬取某乎某选全部内容
在发布了python爬取知乎盐选文章内容后,没想到居然这么快就要更新新的内容了. 在下午思考第一篇python爬取知乎盐选文章内容的时候,其实就把自动爬取目录内的其他内容的方法想出来了,但是本来没想这 ...
python爬取贴吧所有标题的评论_用BS4爬取贴吧文章的作者信息时，如何兼顾爬取高亮的作者信息？...
百度贴吧上的文章信息中,一般的作者信息代码,如下所示: 别让依靠成而有部分作者信息是橙色的.如下所示: 冰缘瑞雪... # -*-coding:utf-8-*- """ ...
python爬虫入门实战！爬取博客文章标题和链接！
最近有小伙伴和我留言想学python爬虫,那么就搞起来吧. 准备阶段爬虫有什么用呢?举个最简单的小例子,你需要<战狼2>的所有豆瓣影评.最先想的做法可能是打开浏览器,进入该网站,找到评论 ...

python 爬取上海体育彩票文章标题、时间、内容

python期末大作业爬取上海体育彩票文章标题、时间、内容并计算词频、生成特殊形状的词云图

python 爬取上海体育彩票文章标题、时间、内容相关推荐

最新文章

热门文章

python 爬取上海体育彩票文章标题、时间、内容

python期末大作业 爬取上海体育彩票文章标题、时间、内容 并计算词频、生成特殊形状的词云图

python 爬取上海体育彩票文章标题、时间、内容相关推荐

最新文章

热门文章

python期末大作业爬取上海体育彩票文章标题、时间、内容并计算词频、生成特殊形状的词云图