python 爬虫包_Python爬虫包BeautifulSoup实例（三）

一步一步构建一个爬虫实例，抓取糗事百科的段子

先不用beautifulsoup包来进行解析

第一步，访问网址并抓取源码

# -*- coding: utf-8 -*-

# @Author: HaonanWu

# @Date: 2016-12-22 16:16:08

# @Last Modified by: HaonanWu

# @Last Modified time: 2016-12-22 20:17:13

import urllib

import urllib2

import re

import os

if __name__ == '__main__':

# 访问网址并抓取源码

url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'

headers = {'User-Agent':user_agent}

try:

request = urllib2.Request(url = url, headers = headers)

response = urllib2.urlopen(request)

content = response.read()

except urllib2.HTTPError as e:

print e

exit()

except urllib2.URLError as e:

print e

exit()

print content.decode('utf-8')

第二步，利用正则表达式提取信息

首先先观察源码中，你需要的内容的位置以及如何识别

然后用正则表达式去识别读取

注意正则表达式中的 . 是不能匹配\n的，所以需要设置一下匹配模式。

# -*- coding: utf-8 -*-

# @Author: HaonanWu

# @Date: 2016-12-22 16:16:08

# @Last Modified by: HaonanWu

# @Last Modified time: 2016-12-22 20:17:13

import urllib

import urllib2

import re

import os

if __name__ == '__main__':

# 访问网址并抓取源码

url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'

headers = {'User-Agent':user_agent}

try:

request = urllib2.Request(url = url, headers = headers)

response = urllib2.urlopen(request)

content = response.read()

except urllib2.HTTPError as e:

print e

exit()

except urllib2.URLError as e:

print e

exit()

regex = re.compile('

.*?(.*?).*?

', re.S)

items = re.findall(regex, content)

# 提取数据

# 注意换行符，设置 . 能够匹配换行符

for item in items:

print item

第三步，修正数据并保存到文件中

# -*- coding: utf-8 -*-

# @Author: HaonanWu

# @Date: 2016-12-22 16:16:08

# @Last Modified by: HaonanWu

# @Last Modified time: 2016-12-22 21:41:32

import urllib

import urllib2

import re

import os

if __name__ == '__main__':

# 访问网址并抓取源码

url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'

headers = {'User-Agent':user_agent}

try:

request = urllib2.Request(url = url, headers = headers)

response = urllib2.urlopen(request)

content = response.read()

except urllib2.HTTPError as e:

print e

exit()

except urllib2.URLError as e:

print e

exit()

regex = re.compile('

.*?(.*?).*?

', re.S)

items = re.findall(regex, content)

# 提取数据

# 注意换行符，设置 . 能够匹配换行符

path = './qiubai'

if not os.path.exists(path):

os.makedirs(path)

count = 1

for item in items:

#整理数据，去掉\n,将
换成\n

item = item.replace('\n', '').replace('
', '\n')

filepath = path + '/' + str(count) + '.txt'

f = open(filepath, 'w')

f.write(item)

f.close()

count += 1

第四步，将多个页面下的内容都抓取下来

# -*- coding: utf-8 -*-

# @Author: HaonanWu

# @Date: 2016-12-22 16:16:08

# @Last Modified by: HaonanWu

# @Last Modified time: 2016-12-22 20:17:13

import urllib

import urllib2

import re

import os

if __name__ == '__main__':

# 访问网址并抓取源码

path = './qiubai'

if not os.path.exists(path):

os.makedirs(path)

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'

headers = {'User-Agent':user_agent}

regex = re.compile('

.*?(.*?).*?

', re.S)

count = 1

for cnt in range(1, 35):

print '第' + str(cnt) + '轮'

url = 'http://www.qiushibaike.com/textnew/page/' + str(cnt) + '/?s=4941357'

try:

request = urllib2.Request(url = url, headers = headers)

response = urllib2.urlopen(request)

content = response.read()

except urllib2.HTTPError as e:

print e

exit()

except urllib2.URLError as e:

print e

exit()

# print content

# 提取数据

# 注意换行符，设置 . 能够匹配换行符

items = re.findall(regex, content)

# 保存信息

for item in items:

# print item

#整理数据，去掉\n,将
换成\n

item = item.replace('\n', '').replace('
', '\n')

filepath = path + '/' + str(count) + '.txt'

f = open(filepath, 'w')

f.write(item)

f.close()

count += 1

print '完成'

使用BeautifulSoup对源码进行解析

# -*- coding: utf-8 -*-

# @Author: HaonanWu

# @Date: 2016-12-22 16:16:08

# @Last Modified by: HaonanWu

# @Last Modified time: 2016-12-22 21:34:02

import urllib

import urllib2

import re

import os

from bs4 import BeautifulSoup

if __name__ == '__main__':

url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357'

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'

headers = {'User-Agent':user_agent}

request = urllib2.Request(url = url, headers = headers)

response = urllib2.urlopen(request)

# print response.read()

soup_packetpage = BeautifulSoup(response, 'lxml')

items = soup_packetpage.find_all("div", class_="content")

for item in items:

try:

content = item.span.string

except AttributeError as e:

print e

exit()

if content:

print content + "\n"

这是用BeautifulSoup去抓取书本以及其价格的代码

可以通过对比得出到bs4对标签的读取以及标签内容的读取

(因为我自己也没有学到这一部分，目前只能依葫芦画瓢地写)

# -*- coding: utf-8 -*-

# @Author: HaonanWu

# @Date: 2016-12-22 20:37:38

# @Last Modified by: HaonanWu

# @Last Modified time: 2016-12-22 21:27:30

import urllib2

import urllib

import re

from bs4 import BeautifulSoup

url = "https://www.packtpub.com/all"

try:

html = urllib2.urlopen(url)

except urllib2.HTTPError as e:

print e

exit()

soup_packtpage = BeautifulSoup(html, 'lxml')

all_book_title = soup_packtpage.find_all("div", class_="book-block-title")

price_regexp = re.compile(u"\s+\$\s\d+\.\d+")

for book_title in all_book_title:

try:

print "Book's name is " + book_title.string.strip()

except AttributeError as e:

print e

exit()

book_price = book_title.find_next(text=price_regexp)

try:

print "Book's price is "+ book_price.strip()

except AttributeError as e:

print e

exit()

print ""

以上全部为本篇文章的全部内容，希望对大家的学习有所帮助，也希望大家多多支持脚本之家。

python 爬虫包_Python爬虫包BeautifulSoup实例（三）相关推荐

python 安卓模拟器抓包_python + 爬虫 + fiddler + 夜神模拟器爬取app(1)
抓包抓包是爬虫里面经常用到的一个词,完整的应该叫做抓取数据请求响应包 ,而Fiddler这款工具就是干这个的普通https抓包设置打开Fiddler ------> Options .然后 ...
python版本回退_Python爬虫之BeautifulSoup解析之路
上一篇分享了正则表达式的使用,相信大家对正则也已经有了一定的了解.它可以针对任意字符串做任何的匹配并提取所需信息. 但是我们爬虫基本上解析的都是html或者xml结构的内容,而非任意字符串.正则表达式 ...
python爬虫要点_Python爬虫知识点梳理
学任何一门技术,都应该带着目标去学习,目标就像一座灯塔,指引你前进,很多人学着学着就学放弃了,很大部分原因是没有明确目标,所以,在你准备学爬虫前,先问问自己为什么要学习爬虫.有些人是为了一份工作,有些 ...
python soup歌词_Python 爬虫获取网易云音乐歌手的歌词
上一篇文章爬取了歌手的姓名和歌手的 id ,这篇文章根据上篇爬取的歌手 id 来直接下载对应歌手的歌词.这些我其实可以写成一个大项目,把这个大项目拆成小项目一来方便大家的理解,二来小项目都会了的话,拼 ...
花一千多学python值吗_Python爬虫应该怎么学？程序猿花了一周整理的学习技巧，请收下...
原标题:Python爬虫应该怎么学?程序猿花了一周整理的学习技巧,请收下 Python爬虫为什么受欢迎如果你仔细观察,就不难发现,懂爬虫.学习爬虫的人越来越多,一方面,互联网可以获取的数据越来越多, ...
python流行的爬虫框架_Python爬虫相关框架
Python爬虫相关框架,Python的爬虫框架就是一些爬虫项目的半成品.比如我们可以将一些常见爬虫功能的实现代码写好,然后留下一些接口,在做不同的爬虫项目时,我们只需要根据实际情况,只需要写少量需要 ...
python爬虫分析_Python爬虫解析网页的4种方式
文章目录爬虫的价值正则表达式 requests-html BeautifulSoup lxml的XPath 爬虫的价值常见的数据获取方式就三种:自有数据.购买数据.爬取数据.用Python写爬虫 ...
python爬虫代理服务器_Python爬虫之服务器：代理IP万能
最近很多同学租服务器用来学习爬虫,对于大部分小白来说,爬虫非常复杂.技术门槛很高.但我们可以通过爬虫获取大量的价值数据,经分析可以发挥巨大的价值,比如:豆瓣.知乎,爬取优质答案,筛选出各话题下热门内容 ...
python爬虫难点_Python爬虫技巧
在本文中,我们将分析几个真实网站,来看看我们在<用Python写网络爬虫(第2版)>中学过的这些技巧是如何应用的.首先我们使用Google演示一个真实的搜索表单,然后是依赖JavaScr ...
python简单爬虫手机号_Python爬虫：大家用公共的手机号干了啥？
说明:本文所提供的思路和代码都只用于个人测试研究之用,并未对目标网站造成实质性干扰,而且全部细节已经全部告知网站开发者,也请大家不要用于恶意用途. 在我的微信公众号"免费的临时手机号,用这些 ...

python 爬虫包_Python爬虫包BeautifulSoup实例（三）

python 爬虫包_Python爬虫包BeautifulSoup实例（三）相关推荐

最新文章

热门文章

python 爬虫 包_Python爬虫包BeautifulSoup实例（三）

python 爬虫 包_Python爬虫包BeautifulSoup实例（三）相关推荐

最新文章

热门文章

python 爬虫包_Python爬虫包BeautifulSoup实例（三）

python 爬虫包_Python爬虫包BeautifulSoup实例（三）相关推荐