利用Python爬取散文网的文章实例

这篇文章主要跟大家介绍了利用python爬取散文网文章的相关资料，文中介绍的非常详细，对大家具有一定的参考学习价值，需要的朋友们下面来一起看看吧。

本文主要给大家介绍的是关于python爬取散文网文章的相关内容，分享出来供大家参考学习，下面一起来看看详细的介绍：

配置python 2.7

 bs4requests

安装用pip进行安装 sudo pip install bs4

sudo pip install requests

简要说明一下bs4的使用因为是爬取网页所以就介绍find 跟find_all

find跟find_all的不同在于返回的东西不同 find返回的是匹配到的第一个标签及标签里的内容

find_all返回的是一个列表

比如我们写一个test.html 用来测试find跟find_all的区别。

内容是：

<html>
<head>
</head>
<body>
<div id="one"><a></a></div>
<div id="two"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >abc</a></div>
<div id="three"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a></div>
<div id="four"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >four<p>four p</p><p>four p</p><p>four p</p> a</a></div>
</body>
</html>

然后test.py的代码为：

from bs4 import BeautifulSoup
import lxmlif __name__=='__main__':s = BeautifulSoup(open('test.html'),'lxml')print s.prettify()print "------------------------------"print s.find('div')print s.find_all('div')print "------------------------------"print s.find('div',id='one')print s.find_all('div',id='one')print "------------------------------"print s.find('div',id="two")print s.find_all('div',id="two")print "------------------------------"print s.find('div',id="three")print s.find_all('div',id="three")print "------------------------------"print s.find('div',id="four")print s.find_all('div',id="four")print "------------------------------"

运行以后我们可以看到结果当获取指定标签时候两者区别不大当获取一组标签的时候两者的区别就会显示出来

所以我们在使用时候要注意到底要的是什么，否则会出现报错

接下来就是通过requests 获取网页信息了，我不太懂别人为什么要写heard跟其他的东西

我直接进行网页访问，通过get方式获取散文网几个分类的二级网页然后通过一个组的测试，把所有的网页爬取一遍

def get_html():url = "https://www.sanwen.net/"two_html = ['sanwen','shige','zawen','suibi','rizhi','novel']for doc in two_html:i=1if doc=='sanwen':print "running sanwen -----------------------------"if doc=='shige':print "running shige ------------------------------"if doc=='zawen':print 'running zawen -------------------------------'if doc=='suibi':print 'running suibi -------------------------------'if doc=='rizhi':print 'running ruzhi -------------------------------'if doc=='nove':print 'running xiaoxiaoshuo -------------------------'while(i<10):par = {'p':i}res = requests.get(url+doc+'/',params=par)if res.status_code==200:soup(res.text)i+=i

这部分的代码中我没有对res.status_code不是200的进行处理，导致的问题是会不显示错误，爬取的内容会有丢失。然后分析散文网的网页，发现是www.sanwen.net/rizhi/&p=1

p最大值是10这个不太懂，上次爬盘多多是100页，算了算了以后再分析。然后就通过get方法获取每页的内容。

获取每页内容以后就是分析作者跟题目了代码是这样的

def soup(html_text):s = BeautifulSoup(html_text,'lxml')link = s.find('div',class_='categorylist').find_all('li')for i in link:if i!=s.find('li',class_='page'):title = i.find_all('a')[1]author = i.find_all('a')[2].texturl = title.attrs['href']sign = re.compile(r'(//)|/')match = sign.search(title.text)file_name = title.textif match:file_name = sign.sub('a',str(title.text))

获取标题的时候出现坑爹的事，请问大佬们写散文你标题加斜杠干嘛，不光加一个还有加两个的，这个问题直接导致我后面写入文件的时候文件名出现错误，于是写正则表达式，我给你改行了吧。

最后就是获取散文内容了，通过每页的分析，获得文章地址，然后直接获取内容，本来还想直接通过改网页地址一个一个的获得呢，这样也省事了。

def get_content(url):

res = requests.get('https://www.sanwen.net'+url)

if res.status_code==200:

soup = BeautifulSoup(res.text,'lxml')

contents = soup.find('div',class_='content').find_all('p')

content = ''

for i in contents:

content+=i.text+'\n'

return content

最后就是写入文件保存ok

f = open(file_name+'.txt','w')

print 'running w txt'+file_name+'.txt'

f.write(title.text+'\n')

f.write(author+'\n')

content=get_content(url)

f.write(content)

f.close()

三个函数获取散文网的散文，不过有问题，问题在于不知道为什么有些散文丢失了我只能获取到大概400多篇文章，这跟散文网的文章是差很多很多的，但是确实是一页一页的获取来的，这个问题希望大佬帮忙看看。

 f = open(file_name+'.txt','w')print 'running w txt'+file_name+'.txt'f.write(title.text+'\n')f.write(author+'\n')content=get_content(url) f.write(content)f.close()

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作能带来一定的帮助，如果有疑问大家可以留言交流，谢谢大家对我的支持

作者微信： CSJH2209

扫码看更多精彩文章

利用Python爬取散文网的文章实例相关推荐

利用python爬取东方财富网股吧评论并进行情感分析（一）
利用python爬取东方财富网股吧评论(一) python-东方财富网贴吧文本数据爬取分享一下写论文时爬数据用到的代码,有什么问题或者改善的建议的话小伙伴们一起评论区讨论.涉及内容在前人的研究基础之 ...
Python爬取散文网散文
配置python 2.7 bs4requests 安装用pip进行安装 sudo pip install bs4 sudo pip install requests 简要说明一下bs4的使用因为是爬 ...
利用python爬取贝壳网租房信息
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取 python免费学习资 ...
python 爬取贝壳网小区名称_利用python爬取贝壳网租房信息
#@Author: Rainbowhhy#@Date : 19-6-25 下午6:35 importrequestsimporttimefrom lxml importetreeimportxlsxw ...
Python爬取书包网文章实战总结
python爬取书包网文章总结今天闲来无事去看小说,但是发现没办法直接下载,所以呢就用python爬虫来下载一波了,哈哈- 爬取的是这篇小说:剑破九天(是不是很霸气,话不多说,开始-) 总体思路步骤 ...
利用python爬取qq个性网图片
利用python爬取qq个性网图片网站头像布局大同小异,稍改代码即可爬取想要的头像. 不多bb,上代码. import requests from parsel import Selector im ...
利用 Python 爬取了近 3000 条单身女生的数据，究竟她们理想的择偶标准是什么？
灵感来源与学习:利用 Python 爬取了 13966 条运维招聘信息,我得出了哪些结论? 本文原创作者:壹加柒本文来源链接:https://blog.csdn.net/yu1300000363/a ...
python爬取前程无忧_用python爬取前程无忧网，看看我们是否真的“前程无忧”？...
The best time to plant a tree was 10 years ago,the second best time is now. 种一棵树最好的时间是十年前,其次是现在. 利用p ...
在当当买了python怎么下载源代码-Python爬取当当网最受欢迎的 500 本书
想看好书?想知道哪些书比较多人推荐,最好的方式就是看数据,接下来用 Python 爬取当当网五星图书榜 TOP500 的书籍,或许能给我们参考参考! Python爬取目标爬取当当网前500本受欢迎的 ...

利用Python爬取散文网的文章实例

利用Python爬取散文网的文章实例相关推荐

最新文章

热门文章