python+BeautifulSoup+多进程爬取糗事百科图片

用到的库；

import requests
import os
from bs4 import BeautifulSoup
import time
from multiprocessing import Pool

定义图片存储路径；

    path = r'E:\爬虫\0805\\'

请求头，模拟浏览器请求；

在浏览器中的位置，按f12打开开发者模式；

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}

主函数；

--------------------------------------------------------------------
注：如果你对python感兴趣，我这有个学习Python基地，里面有很多学习资料，感兴趣的+Q群：895817687
--------------------------------------------------------------------
def get_images(url):data = 'https:'res = requests.get(url,headers=headers)soup = BeautifulSoup(res.text,'lxml')url_infos = soup.select('div.thumb > a > img')# print(url_infos)for url_info in url_infos:try:urls = data+url_info.get('src')if os.path.exists(path+urls.split('/')[-1]):print('图片已下载')else:image = requests.get(urls,headers=headers)with open(path+urls.split('/')[-1],'wb') as fp:fp.write(image.content)print('正在下载：'+urls)time.sleep(0.5)except Exception as e:print(e)

开始爬虫程序；

if __name__ == '__main__':# 路由列表urls = ['https://www.qiushibaike.com/imgrank/page/{}/'.format(i) for i in range(1,14)]# 开启多进程爬取pool = Pool()pool.map(get_images,urls)print('抓取完毕')

爬取中；
打开文件夹查看爬取结果；
done

完整代码；

import requests
import os
from bs4 import BeautifulSoup
import time
from multiprocessing import Pool
"""
************常用爬虫库***********requestsBeautifulSouppyquery lxml
************爬虫框架***********scrapy三大解析方式：re,css,xpath
"""
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
path = r'E:\爬虫\0805\\'
def get_images(url):data = 'https:'res = requests.get(url,headers=headers)soup = BeautifulSoup(res.text,'lxml')url_infos = soup.select('div.thumb > a > img')# print(url_infos)for url_info in url_infos:try:urls = data+url_info.get('src')if os.path.exists(path+urls.split('/')[-1]):print('图片已下载')else:image = requests.get(urls,headers=headers)with open(path+urls.split('/')[-1],'wb') as fp:fp.write(image.content)print('正在下载：'+urls)time.sleep(0.5)except Exception as e:print(e)if __name__ == '__main__':# 路由列表urls = ['https://www.qiushibaike.com/imgrank/page/{}/'.format(i) for i in range(1,14)]# 开启多进程爬取pool = Pool()pool.map(get_images,urls)print('抓取完毕')

python+BeautifulSoup+多进程爬取糗事百科图片相关推荐

python+正则+多进程爬取糗事百科图片
话不多说,直接上代码: # 需要的库 import requests import re import os from multiprocessing import Pool # 请求头 header ...
python实现数据爬取——糗事百科爬虫项目
python实现数据爬取--糗事百科爬虫项目 # urllib.request 请求模块 import urllib.request # re 模块使 Python 语言拥有全部的正则表达式功能. i ...
python爬虫经典段子_玩转python爬虫之爬取糗事百科段子
大家好,前面入门已经说了那么多基础知识了,下面我们做几个实战项目来挑战一下吧.那么这次为大家带来,Python爬取糗事百科的小段子的例子. 首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把 ...
爬取糗事百科图片（正则案例）
首先前面来到糗事百科中的热图页面.然后右键检查打开浏览器的抓包工具,如下图所示: 通过对请求headers与响应response与element的分析,我们可以了解到图片链接就在该地址栏的网址下的di ...
爬虫爬取糗事百科图片数据
如图,爬取所有图片,不包含文字通过抓包工具可知每一张图片所在div的class="thumb",利用正则表达式 ex = '<div class="thumb&q ...
python爬虫，爬取糗事百科并保存到文件中
#--*--coding:utf-8--*-- import re import urllib.request from urllib.error import URLError,HTTPError ...
利用Python爬取糗事百科段子信息
文章来源:公众号-智能化IT系统. 爬虫技术目前越来越流行,这里介绍一个爬虫的简单应用. 爬取的内容为糗事百科文字内容中的信息,如图所示: 爬取糗事百科文字35页的信息,通过手动浏览,以下为前四页的网 ...
读书笔记（4）——python爬取糗事百科，并存到MySQL中
2019独角兽企业重金招聘Python工程师标准>>> 安装MySQL.使用phpStudy集成工具来安装MySQL服务器,或者可以用USBwebserve进行安装. 打开USBwe ...
Python爬虫实战（1）：爬取糗事百科段子
Python爬虫入门(1):综述 Python爬虫入门(2):爬虫基础了解 Python爬虫入门(3):Urllib库的基本使用 Python爬虫入门(4):Urllib库的高级用法 Python爬虫 ...

python+BeautifulSoup+多进程爬取糗事百科图片

python+BeautifulSoup+多进程爬取糗事百科图片相关推荐

最新文章

热门文章