在本篇博文当中,将会教会大家如何使用高性能爬虫,快速爬取并解析页面当中的信息。一般情况下,如果我们请求网页的次数太多,每次都要发出一次请求,进行串行执行的话,那么请求将会占用我们大量的时间,这样得不偿失。因此我们可以i使用高性能爬虫,也就是采用多进程,异步的方式对数据进行爬取和解析,这样就可以在更快的时间内得到我们想要的结果。本篇博文给出有关爬取豆瓣电影的例子,以此来教会大家如何使用高性能爬虫。

一.网页分析

首先我们来分析豆瓣电影的网页代码,在本次的案例当中。我们需要爬取豆瓣电影top250当中的标题title和星数star。

发现,豆瓣电影当中的所有有关电影的信息全部都隐藏在< ol class="grid view">这个标签,当中,因此我们在编写xpath的时候,可以利用对它做一个循环。然后又发现,对于电影的title而言,有两个地方出现,一个地方是在图片上,另一个地方是在span标签下的class = title处,但是在span标签下具有多个标题,为了以免引起混,因此我们使用图片当中所暗含的标题title文字,使用xpath进行定位即可。

对于star而言,就更加简单了。我们发现每次一个star的分数出现,就会有又一个<div class="star">的标签在前面,然后再出现了与span有关的标签,因此我们编写xpath表达式为://ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()

这样就可以得到一整个页面的star的数值了。当然这样我们只能获取第一页的我们想要得到的数据,怎么得到第二页的数据呢?

二.翻页处理

翻页处理对于豆瓣电影这个网站还是比较简单的。我们分别查看第一,二,三页的url,就会惊奇的发现它的网址如下:

https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=

十分明显,这个网址后面有问号说明想要获取页面内容肯定需要发起get请求,都没有做有关post请求的加密,这样看来这也太简单了吧!

同样的我们发现里面的参数start在不断的变化,而filter却保持不变。因此我们只需要得到start参数的规律就知道该怎么编写爬虫了。

对于start而言,每跳转一页,就会增加25的数值,因为每一个页面里面均仅有25部电影。这样我们就找到了start参数的规律,开始编写爬虫。

三.爬虫代码的编写

在编写的代码时候,我们导入了多进程的库,使用这个库进行爬虫,也就只需要在原本代码的基础之上多添加两行代码即可,如下所示:

pool=Pool(4)
pool.map(get_information,number_ls)

这两行代码当中,第一个参数的4表示了我们使用4个进程的进程池进行数据的抓取。数值越大,爬取的效率就越高,这取决于你CPU的数量,数值不能超过CPU核心数的数量,因为一个一个CPU核心同时只能够运行单个进程。

第二行代码使用了map函数,第一个参数填写我们进行爬虫的函数,第二个参数填写爬虫函数所需要的参数。把这两个东西放到map函数里,就可以开始高性能爬虫了。

Remark:
在进行进程池爬虫的时候,我们放入的参数number_ls一定是一个列表,同时我们在get_imformation函数里使用得到的参数时,每次系统会调用这个参数列表当中的任意一个数值,而不是对整个列表进行调用。

由于整个原因,因此我们编写整个的代码·如下所示:

import requests
from lxml import etree
from multiprocessing.dummy import Pool
cookie='bid=N3Zqe_FFUKc; douban-fav-remind=1; viewed="27093751"; _vwo_uuid_v2=D401F17C96234AE149C4E04B78C3C8066|6fcc3cefe576bff2b89cdf28c4c5f597; __gads=ID=21cdec44606b00df-2250ba4d7ac4009b:T=1604034713:RT=1604034713:S=ALNI_Mb6iYJKYfbUjLxlisTQX5HCODTGKg; gr_user_id=fb6ac40c-94c3-400e-b170-47e126a9b78a; _gid=GA1.2.1520341169.1612004212; _ga=GA1.2.645228582.1602221486; ll="108288"; UM_distinctid=17752f076e4530-0b6eef25ebabba-f7b1332-1fa400-17752f076e57f0; Hm_lvt_19fc7b106453f97b6a84d64302f21a04=1612004228; Hm_lpvt_19fc7b106453f97b6a84d64302f21a04=1612004253; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1612004299%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.645228582.1602221486.1611225800.1612004300.9; __utmb=30149280.0.10.1612004300; __utmc=30149280; __utmz=30149280.1612004300.9.9.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utma=223695111.645228582.1602221486.1612004300.1612004300.1; __utmb=223695111.0.10.1612004300; __utmc=223695111; __utmz=223695111.1612004300.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); _pk_id.100001.4cf6=9a1bb1df4597b334.1612004299.1.1612005471.1612004299.'headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
}url='https://movie.douban.com/top250'number_ls=[]
for i in range(0,251,25):number_ls.append(i)print(number_ls)def get_information(number_ls):param={'start':number_ls,'filter' :''}page_content=requests.get(url=url,headers=headers,params=param).textwith open('douban.html','w',encoding='utf-8') as fp:fp.write(page_content)tree=etree.HTML(page_content)vedio_title=tree.xpath('//ol[@class="grid_view"]//div[@class="pic"]//a/img/@alt')star=tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()')vedio_title_ls=[]star_ls=[]for i in vedio_title:vedio_title_ls.append(i)for i in star:star_ls.append(i)j=0while j<len(star_ls):print("the movie is ",vedio_title_ls[j])print("the star is ",star_ls[j])print()j+=1pool=Pool(4)
pool.map(get_information,number_ls)

四.输出的结果

输出的结果十分完美,一共有250份电影,如下图所示:

[0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250]
the movie is  搏击俱乐部
the star is  9.0the movie is  教父2
the star is  9.2the movie is  狮子王
the star is  9.0the movie is  指环王2:双塔奇兵
the star is  9.1the movie is  死亡诗社
the star is  9.1the movie is  钢琴家
the star is  9.2the movie is  黑客帝国
the star is  9.0the movie is  指环王1:魔戒再现
the star is  9.0the movie is  饮食男女
the star is  9.1the movie is  窃听风暴
the star is  9.1the movie is  美丽心灵
the star is  9.0the movie is  让子弹飞
the star is  8.8the movie is  绿皮书
the star is  8.9the movie is  两杆大烟枪
the star is  9.1the movie is  本杰明·巴顿奇事
the star is  8.9the movie is  海蒂和爷爷
the star is  9.2the movie is  飞越疯人院
the star is  9.1the movie is  看不见的客人
the star is  8.8the movie is  西西里的美丽传说
the star is  8.9the movie is  拯救大兵瑞恩
the star is  9.0the movie is  穿条纹睡衣的男孩
the star is  9.1the movie is  小鞋子
the star is  9.2the movie is  音乐之声
the star is  9.0the movie is  情书
the star is  8.9the movie is  海豚湾
the star is  9.3the movie is  美国往事
the star is  9.2the movie is  致命魔术
the star is  8.9the movie is  沉默的羔羊
the star is  8.9the movie is  低俗小说
the star is  8.9the movie is  禁闭岛
the star is  8.8the movie is  蝴蝶效应
the star is  8.8the movie is  七宗罪
the star is  8.8the movie is  心灵捕手
the star is  8.9the movie is  布达佩斯大饭店
the star is  8.9the movie is  春光乍泄
the star is  8.9the movie is  摩登时代
the star is  9.3the movie is  被嫌弃的松子的一生
the star is  8.9the movie is  哈利·波特与死亡圣器(下)
the star is  8.9the movie is  阿凡达
the star is  8.7the movie is  喜剧之王
the star is  8.8the movie is  致命ID
the star is  8.8the movie is  剪刀手爱德华
the star is  8.7the movie is  勇敢的心
the star is  8.9the movie is  加勒比海盗
the star is  8.8the movie is  杀人回忆
the star is  8.9the movie is  狩猎
the star is  9.1the movie is  请以你的名字呼唤我
the star is  8.9the movie is  天使爱美丽
the star is  8.7the movie is  断背山
the star is  8.8the movie is  红辣椒
the star is  9.0the movie is  触不可及
the star is  9.2the movie is  蝙蝠侠:黑暗骑士
the star is  9.2the movie is  末代皇帝
the star is  9.3the movie is  活着
the star is  9.3the movie is  寻梦环游记
the star is  9.1the movie is  乱世佳人
the star is  9.3the movie is  何以为家
the star is  9.1the movie is  指环王3:王者无敌
the star is  9.2the movie is  飞屋环游记
the star is  9.0the movie is  摔跤吧!爸爸
the star is  9.0the movie is  哈利·波特与魔法石
the star is  9.1the movie is  素媛
the star is  9.3the movie is  少年派的奇幻漂流
the star is  9.1the movie is  十二怒汉
the star is  9.4the movie is  哈尔的移动城堡
the star is  9.1the movie is  鬼子来了
the star is  9.3the movie is  天空之城
the star is  9.1the movie is  大话西游之月光宝盒
the star is  9.0the movie is  我不是药神
the star is  9.0the movie is  闻香识女人
the star is  9.1the movie is  罗马假日
the star is  9.0the movie is  天堂电影院
the star is  9.2the movie is  辩护人
the star is  9.2the movie is  猫鼠游戏
the star is  9.0the movie is  大闹天宫
the star is  9.4the movie is  肖申克的救赎
the star is  9.7the movie is  霸王别姬
the star is  9.6the movie is  阿甘正传
the star is  9.5the movie is  这个杀手不太冷
the star is  9.4the movie is  泰坦尼克号
the star is  9.4the movie is  美丽人生
the star is  9.5the movie is  千与千寻
the star is  9.4the movie is  辛德勒的名单
the star is  9.5the movie is  盗梦空间
the star is  9.3the movie is  忠犬八公的故事
the star is  9.4the movie is  星际穿越
the star is  9.3the movie is  海上钢琴师
the star is  9.3the movie is  楚门的世界
the star is  9.3the movie is  三傻大闹宝莱坞
the star is  9.2the movie is  机器人总动员
the star is  9.3the movie is  放牛班的春天
the star is  9.3the movie is  大话西游之大圣娶亲
the star is  9.2the movie is  疯狂动物城
the star is  9.2the movie is  无间道
the star is  9.2the movie is  熔炉
the star is  9.3the movie is  教父
the star is  9.3the movie is  当幸福来敲门
the star is  9.1the movie is  龙猫
the star is  9.2the movie is  怦然心动
the star is  9.1the movie is  控方证人
the star is  9.6the movie is  7号房的礼物
the star is  8.9the movie is  幽灵公主
the star is  8.9the movie is  小森林 夏秋篇
the star is  9.0the movie is  阳光灿烂的日子
the star is  8.8the movie is  第六感
the star is  8.9the movie is  重庆森林
the star is  8.8the movie is  入殓师
the star is  8.9the movie is  唐伯虎点秋香
the star is  8.7the movie is  小森林 冬春篇
the star is  9.0the movie is  爱在黎明破晓前
the star is  8.8the movie is  超脱
the star is  8.9the movie is  消失的爱人
the star is  8.7the movie is  一一
the star is  9.0the movie is  菊次郎的夏天
the star is  8.8the movie is  蝙蝠侠:黑暗骑士崛起
the star is  8.8the movie is  侧耳倾听
the star is  8.9the movie is  倩女幽魂
the star is  8.7the movie is  功夫
the star is  8.6the movie is  超能陆战队
the star is  8.7the movie is  无人知晓
the star is  9.1the movie is  人生果实
the star is  9.5the movie is  萤火之森
the star is  8.9the movie is  甜蜜蜜
the star is  8.8the movie is  借东西的小人阿莉埃蒂
the star is  8.8the movie is  玛丽和马克思
the star is  8.9the movie is  爱在日落黄昏时
the star is  8.8the movie is  驯龙高手
the star is  8.7the movie is  完美的世界
the star is  9.1the movie is  幸福终点站
the star is  8.8the movie is  告白
the star is  8.7the movie is  大鱼
the star is  8.8the movie is  阳光姐妹淘
the star is  8.8the movie is  射雕英雄传之东成西就
the star is  8.7the movie is  哈利·波特与阿兹卡班的囚徒
the star is  8.8the movie is  恐怖直播
the star is  8.8the movie is  天书奇谭
the star is  9.2the movie is  怪兽电力公司
the star is  8.7the movie is  神偷奶爸
the star is  8.6the movie is  玩具总动员3
the star is  8.8the movie is  傲慢与偏见
the star is  8.6the movie is  时空恋旅人
the star is  8.8the movie is  哈利·波特与密室
the star is  8.7the movie is  教父3
the star is  8.9the movie is  釜山行
the star is  8.6the movie is  血战钢锯岭
the star is  8.7the movie is  哪吒闹海
the star is  9.1the movie is  被解救的姜戈
the star is  8.7the movie is  七武士
the star is  9.3the movie is  喜宴
the star is  8.9the movie is  电锯惊魂
the star is  8.7the movie is  爆裂鼓手
the star is  8.7the movie is  贫民窟的百万富翁
the star is  8.6the movie is  萤火虫之墓
the star is  8.7the movie is  东邪西毒
the star is  8.6the movie is  海街日记
the star is  8.8the movie is  黑天鹅
the star is  8.6the movie is  惊魂记
the star is  9.0the movie is  无敌破坏王
the star is  8.7the movie is  你看起来好像很好吃
the star is  8.9the movie is  冰川时代
the star is  8.6the movie is  雨人
the star is  8.7the movie is  小偷家族
the star is  8.7the movie is  绿里奇迹
the star is  8.9the movie is  恋恋笔记本
the star is  8.5the movie is  爱在午夜降临前
the star is  8.8the movie is  疯狂的石头
the star is  8.5the movie is  哈利·波特与火焰杯
the star is  8.6the movie is  寄生虫
the star is  8.7the movie is  恐怖游轮
the star is  8.5the movie is  奇迹男孩
the star is  8.6the movie is  雨中曲
the star is  9.0the movie is  魔女宅急便
the star is  8.7the movie is  二十二
the star is  8.7the movie is  海边的曼彻斯特
the star is  8.6the movie is  房间
the star is  8.8the movie is  风之谷
the star is  8.9the movie is  一个叫欧维的男人决定去死
the star is  8.9the movie is  我是山姆
the star is  8.9the movie is  头号玩家
the star is  8.7the movie is  英雄本色
the star is  8.7the movie is  上帝之城
the star is  9.0the movie is  谍影重重3
the star is  8.8the movie is  疯狂原始人
the star is  8.7the movie is  未麻的部屋
the star is  9.0the movie is  岁月神偷
the star is  8.7the movie is  卢旺达饭店
the star is  8.9the movie is  纵横四海
the star is  8.8the movie is  三块广告牌
the star is  8.7the movie is  达拉斯买家俱乐部
the star is  8.8the movie is  花样年华
the star is  8.7the movie is  心迷宫
the star is  8.7the movie is  记忆碎片
the star is  8.6the movie is  模仿游戏
the star is  8.7the movie is  黑客帝国3:矩阵革命
the star is  8.8the movie is  新世界
the star is  8.8the movie is  头脑特工队
the star is  8.7the movie is  荒蛮故事
the star is  8.8the movie is  你的名字。
the star is  8.4the movie is  真爱至上
the star is  8.6the movie is  忠犬八公物语
the star is  9.2the movie is  谍影重重2
the star is  8.7the movie is  阿飞正传
the star is  8.5the movie is  地球上的星星
the star is  8.9the movie is  彗星来的那一夜
the star is  8.5the movie is  完美陌生人
the star is  8.5the movie is  战争之王
the star is  8.7the movie is  谍影重重
the star is  8.6the movie is  香水
the star is  8.5the movie is  东京教父
the star is  9.0the movie is  东京物语
the star is  9.2the movie is  朗读者
the star is  8.6the movie is  千钧一发
the star is  8.8the movie is  再次出发之纽约遇见你
the star is  8.6the movie is  驴得水
the star is  8.3the movie is  猜火车
the star is  8.5the movie is  黑客帝国2:重装上阵
the star is  8.6the movie is  无间道2
the star is  8.6the movie is  我爱你
the star is  9.1the movie is  浪潮
the star is  8.7the movie is  崖上的波妞
the star is  8.5the movie is  聚焦
the star is  8.8the movie is  小萝莉的猴神大叔
the star is  8.4the movie is  追随
the star is  8.9the movie is  黑鹰坠落
the star is  8.7the movie is  网络谜踪
the star is  8.6the movie is  虎口脱险
the star is  8.9the movie is  人工智能
the star is  8.7the movie is  九品芝麻官
the star is  8.6the movie is  2001太空漫游
the star is  8.8the movie is  可可西里
the star is  8.8the movie is  罗生门
the star is  8.8the movie is  色,戒
the star is  8.5the movie is  终结者2:审判日
the star is  8.7the movie is  城市之光
the star is  9.3the movie is  初恋这件小事
the star is  8.4the movie is  魂断蓝桥
the star is  8.8the movie is  牯岭街少年杀人事件
the star is  8.9the movie is  遗愿清单
the star is  8.7the movie is  大佛普拉斯
the star is  8.7the movie is  新龙门客栈
the star is  8.6the movie is  波西米亚狂想曲
the star is  8.7the movie is  源代码
the star is  8.5the movie is  青蛇
the star is  8.6the movie is  海洋
the star is  9.1the movie is  燃情岁月
the star is  8.8the movie is  无耻混蛋
the star is  8.6the movie is  疯狂的麦克斯4:狂暴之路
the star is  8.6the movie is  血钻
the star is  8.7the movie is  穿越时空的少女
the star is  8.6the movie is  步履不停
the star is  8.8

【Python爬虫】:使用高性能异步多进程爬虫获取豆瓣电影Top250相关推荐

  1. 爬虫入门经典(四) | 如何爬取豆瓣电影Top250

      大家好,我是不温卜火,是一名计算机学院大数据专业大三的学生,昵称来源于成语-不温不火,本意是希望自己性情温和.作为一名互联网行业的小白,博主写博客一方面是为了记录自己的学习过程,另一方面是总结自己 ...

  2. Python爬虫获取豆瓣电影TOP250

    最近在学python,研究了下,写了两个爬虫成功爬取了一些东西.有一个很黄很暴力,就不放出来了,还有一个比较绿色,简单,适合初学者学习,思路也比较清晰,也方便我以后再捡起来,注释写的很清楚,特把源码放 ...

  3. python爬虫-豆瓣电影Top250

    豆瓣电影Top250 一.准备环境 idea+python插件/python 一.需求分析 1. 运用代码获取豆瓣电影Top250里面电影的相关信息: 影片详情链接: 影片名称: 影片图片链接: 影片 ...

  4. python爬虫获取豆瓣图书Top250

    在上一篇博客<python爬虫获取豆瓣电影TOP250>中,小菌为大家带来了如何将豆瓣电影Top250的数据存入MySQL数据库的方法.这次的分享,小菌决定再带着大家去研究如何爬取豆瓣图片 ...

  5. python爬虫实例-运用requests抓取豆瓣电影TOP250(详解)

    目录 开发工具 目标 网页分析 正则匹配分析 代码实例 总结 开发工具 python版本: python-3.8.1-amd64 python开发工具: JetBrains PyCharm 2018. ...

  6. 豆瓣电影TOP250爬虫及可视化分析笔记

      人类社会已经进入大数据时代,大数据深刻改变着我们的工作和生活.随着互联网.移动互联网.社交网络等的迅猛发展,各种数量庞大.种类繁多.随时随地产生和更新的大数据,蕴含着前所未有的社会价值和商业价值! ...

  7. Python爬取豆瓣电影top250的电影信息

    Python爬取豆瓣电影top250的电影信息 前言 一.简介 二.实例源码展示 小结 前言 相信很多小伙伴在学习网络爬虫时,老师们会举一些实例案例讲解爬虫知识,本文介绍的就是经典爬虫实际案例–爬取豆 ...

  8. Python数据采集案例(3):豆瓣电影TOP250采集

    作者:长行 时间:2020.05.28 实现目标 本案计划实现:通过网络请求,获取豆瓣电影TOP250的数据,并存储到Json文件中. 案例应用技巧: GET请求(requests):headers ...

  9. python爬取豆瓣电影top250_用Python爬虫实现爬取豆瓣电影Top250

    用Python爬虫实现爬取豆瓣电影Top250 #爬取 豆瓣电影Top250 #250个电影 ,分为10个页显示,1页有25个电影 import urllib.request from bs4 imp ...

  10. Python爬虫 爬取豆瓣电影TOP250

    Python爬虫 爬取豆瓣电影TOP250 最近在b站上学习了一下python的爬虫,实践爬取豆瓣的电影top250,现在对这两天的学习进行一下总结 主要分为三步: 爬取豆瓣top250的网页,并通过 ...

最新文章

  1. 将行政区域导入SQL SERVER
  2. 单例模式中,你不知道的事~~
  3. Select下拉列表框(添加、删除option)
  4. rp-provide-from-last
  5. 初识canvas,使用canvas做一个百分比加载进度的动画
  6. 参数整定临界比例度实验_PID理解起来很难?系统讲解PID控制及参数调节,理论加实际才好!...
  7. InfoQ中文站2015年度优秀社区编辑评选揭晓
  8. block,inline和inlinke-block细节对比
  9. Unity3D脚本访问与参数传递
  10. 服务器共享文件防复制软件,大势至禁止复制服务器共享文件软件
  11. 社交电商概念与特点,十大社交电商平台及产业链模式图解
  12. oracle语法基础
  13. 技能系统(Unity 3D)——学习笔记(三)
  14. Security+新版601考过啦,分享我的备考经验
  15. Android 全面屏处理(适配挖孔屏、刘海屏) kotlin
  16. TopCoder 介绍
  17. 电子计算机的基本结构基于存储程序,计算机有答案
  18. PHP array_filter无法传变量过滤的小坑
  19. C++ Vjudge 训练题
  20. 输入压缩空间量是分区量吗_都0202年了,对于电脑磁盘分区?你还不会设置!...

热门文章

  1. RFID Hacking③:使用ProxMark3嗅探银行闪付卡信息
  2. 拉普拉斯修正学习笔记
  3. win10找不到oracle修正,简单解决 WIN10更新后 远程桌面提示 CredSSP加密Oracle修正的问题...
  4. wnmp mysql 密码_WNMP(Windows + Nginx + PHP + MySQL) 安装
  5. 国内提供paas平台的有_国内十大paas平台
  6. 网站老是被劫持怎么办、网站被劫持的解决方案有哪些
  7. 利用持续同调在基于深度学习的分割框架中引入显式的拓扑学约束
  8. 基于到达时间差(TDOA)的室内定位(/无线传感器网络定位)——极大似然估计ML
  9. 过cloudflare,使用cfscrape,以及cfscrape挂代理的方式
  10. 成都实施垃圾分类草案