爬虫实战操作（3）—— 获取列表下的新闻、诗词

本文前两部分想实现：给定链接，获取分页的新闻标题内容，部分程序参考爬虫实战操作（2）一新浪新闻内容细节，爬虫的链接是国际新浪网。

1. 单个新闻

获取国际新闻最新消息下得单个信息内容
根据上面得链接简单修改了下程序参数，主要是评论数得修改。

#给一个新闻id,返回一个信息评论数，因为评论数的网址只差一个新闻id不一样
import re
import requests
import json
commentURL = "https://comment.sina.com.cn/page/info?version=1&format=json\
&channel=gj&newsid=comos-i{}&group=0&compress=0&ie=utf-8&oe=utf-8&page=1\
&page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user&callback=jsonp_1601956837238&_=1601956837238"
def getCommentCounts(newsurl):  m = re.search('doc-ii(.+).shtml', newsurl)newsid = m.group(1) #获取新闻编码id comments=requests.get(commentURL.format(newsid))jd=json.loads(comments.text.strip('jsonp_1601956837238').strip('()'))return jd["result"]["count"]["total"]#获取评论数
import requests
from datetime import datetime
from bs4 import BeautifulSoup
#输入：网址；输出：新闻正文，标题，评论数，来源
def getNewsDetail(newsurl):result = {}res = requests.get(newsurl)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')result['title'] = soup.select(".main-title")[0].textresult['newssource'] = soup.select(".source")[0].texttimesource =soup.select(".date")[0].textresult['dt'] = datetime.strptime(timesource, "%Y年%m月%d日 %H:%M")result['article'] = '\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])result['editor'] = soup.select("#article p")[-1].text.strip('责任编辑：')result['comments'] = getCommentCounts(newsurl)return result
import json
news="https://news.sina.com.cn/w/2020-10-06/doc-iivhvpwz0572161.shtml"
getNewsDetail(news)

2. 列表新闻

思想:
先找到控制网页分页的url，如下面的图示
再获取每一页的所有新闻的链接
接着获取每个链接的内容
最后修改分页url的页码

#获取每一页的链接，在调用上面的函数获取每个链接的内容
def parselistlink(url):newsdetails=[]res=requests.get(url)#去除两边的字符串，使得可以用json解析jd=json.loads(res.text.lstrip('newsloadercallback(').rstrip(');'))for ent in jd['result']['data']:#将每页下每个新闻的链接传给getNewsDetail，获取每个新闻的内容newsdetails.append(getNewsDetail(ent['url']))return newsdetailsurl='https://interface.sina.cn/news/get_news_by_channel_new_v2018.d.html?cat_1=51923&show_num=27&level=1,2&page={}&callback=newsloadercallback&_=1601968313565'
result=pd.DataFrame()
import pandas as pd
#获取前5页的内容
for i  in range(1,5):newsurl=url.format(i)newsary=parselistlink(newsurl)result=pd.concat([result,pd.DataFrame(newsary)],axis=0)
print(result)
result1=result.drop_duplicates(keep='first')
result1=result1.reset_index().drop('index',axis=1)
print(result1)

3. 列表诗词

诗词链接：https://www.shicimingju.com/chaxun/zuozhe/9_2.html

1.先获取每一页的诗词的链接

url='http://www.shicimingju.com/chaxun/zuozhe/9.html'
base='https://www.shicimingju.com'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
#使用headers（客户端的一些信息），伪装为人类用户，使得服务器不会简单地识别出是爬虫
r=requests.get(url,headers=headers)
html=r.text.encode(r.encoding).decode()
soup=BeautifulSoup(html,'lxml')
div=soup.find('div',attrs={'class':'card shici_card'})
hrefs=[h3.find('a')['href'] for h3 in div.findAll('h3')]
hrefs=[base+i for i in hrefs]
hrefs

2.再获取所有页码下的所有诗词的链接

def gethrefs(url):headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}#使用headers（客户端的一些信息），伪装为人类用户，使得服务器不会简单地识别出是爬虫base='https://www.shicimingju.com'nexturl=urlans=[]while nexturl!=0:r=requests.get(nexturl,headers=headers)html=r.text.encode(r.encoding).decode()soup=BeautifulSoup(html,'lxml')div=soup.find('div',attrs={'class':'card shici_card'})hrefs=[h3.find('a')['href'] for h3 in div.findAll('h3')]hrefs=[base+i for i in hrefs]try:nexturl=base+soup.find('a',text='下一页')['href']print('读取页码中')except Exception as e:print('已经是最后一页')nexturl=0ans.append(hrefs)return ans

3.获取每个连接下的古诗内容

def writeotxt(url):headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}#使用headers（客户端的一些信息），伪装为人类用户，使得服务器不会简单地识别出是爬虫r=requests.get(url,headers=headers)soup=BeautifulSoup(r.text.encode(r.encoding),'lxml')#数据清洗titile=soup.find('h1',id='zs_title').textcontent=soup.find('div',class_='item_content').text.strip()#先建一个文件夹firedir=os.getcwd()+'苏轼的词'if not os.path.exists(firedir):os.mkdir(firedir)with open (firedir+'/%s.txt'%title,mode='w+',encoding='utf-8') as f:f.write(title+'\n')f.write(content+'\n')print('正在载入第 %d首古诗。。。'%i)

爬虫实战操作（3）—— 获取列表下的新闻、诗词相关推荐

爬虫实战操作（2）—— 新浪新闻内容细节
本文实现获取新浪新闻内容的各种细节,标题.时间.来源.内文.编辑者.评论数. import requests from bs4 import BeautifulSoup res=requests.ge ...
scrapy获取a标签的连接_python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求[前期准备] 2.分析及代码实现(1)获取五大板块详情页url(2)解析每个板块(3)解析每个模块里的标题中详情页信息点击此处,获取 ...
爬虫实现：获取微信好友列表爬取进行好友分析
本项目已经上传到github上面:https://github.com/wangqifan/WeChatAnalyse 这个Demo是利用HttpWebRequest和HttpWebResponse来 ...
python tag对象下有多个标签、属性_Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释...
Apple iPhone 11 (A2223) 128GB 黑色移动联通电信4G手机双卡双待 4999元包邮去购买 > 如何利用Python爬虫库BeautifulSoup获取对象(标签) ...
关于BaiduPCS-Go不能用问题解决，报错【获取目录下的文件列表遇到错误, 远端服务器返回错误】
BaiduPCS-Go不能使用报错:获取目录下的文件列表遇到错误, 远端服务器返回错误, 代码: 4, 消息: No permissionto do this operation, 路 ...
【游戏开发创新】520程序员的浪漫，给CSDN近两万的粉丝比心心（python爬虫 | Unity循环复用列表 | 头像加载与缓存）
文章目录一.前言二.最终效果三.读取CSDN粉丝列表数据 1.分析粉丝列表页面结构 2.爬数据四.Unity制作 1.文件读取 2.c#解析json 3.UGUI循环复用列表 4.头像的加载 ...
python爬网易新闻_Python爬虫实战教程：爬取网易新闻；爬虫精选高手技巧
Python爬虫实战教程:爬取网易新闻:爬虫精选高手技巧发布时间:2020-02-21 17:42:43 前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有, ...
通过Python爬虫按关键词抓取相关的新闻
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途如今各大网站的反爬机制已经可以说是到了丧心病狂的程度,比如大众点评的字符加密.微博的登录验证等.相比较而言,新闻网站的反爬机制 ...
【python 爬虫】 scrapy 入门--爬取百度新闻排行榜
scrapy 入门–爬取百度新闻排行榜环境要求:python2/3(anaconda)scrapy库开发环境:sublime text + windows cmd 下载scrapy(需要pytho ...

爬虫实战操作（3）—— 获取列表下的新闻、诗词

1. 单个新闻

2. 列表新闻

3. 列表诗词

爬虫实战操作（3）—— 获取列表下的新闻、诗词相关推荐

最新文章

热门文章