Python3爬虫项目集：豆瓣电影排行榜top250

文章目录

前言
爬虫概要
解析
代码示例
数据存储

Github地址：https://github.com/pasca520/Python3SpiderSet

前言

关于整理日常练习的一些爬虫小练习，可用作学习使用。

爬取项目以学习为主，尽可能使用更多的模块进行练习，而不是最优解。

爬虫概要

示例	python 库
爬取模块	request
解析模块	BeautifulSoup
存储类型	list（方便存入数据库）

解析

BeautifulSoup参数我整理的一篇文章：https://blog.csdn.net/qinglianchen0851/article/details/102860741

代码示例

# -*- coding: utf-8 -*-import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException
from bs4 import BeautifulSoup# 爬虫主体
def get_page(url):headers = {'Connection': 'keep-alive','Cache-Control': 'max-age=0','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3','Referer': 'https://maoyan.com/board',}try:response = requests.get(url=url, headers=headers).textreturn responseexcept ReadTimeout:  # 访问超时的错误print('Timeout')except ConnectionError:  # 网络中断连接错误print('Connect error')except RequestException:  # 父类错误print('Error')# 解析网页
def parse_page(html):soup = BeautifulSoup(html, 'lxml')grid = soup.find(name="ol", attrs={"class": "grid_view"})movie_list = grid.find_all("li")for movie in movie_list:rank = movie.find(name="em").getText()name = movie.find(name="span", attrs={"class": "title"}).getText()rating_num = movie.find(name="span", attrs={"class": "rating_num"}).getText()# bd = movie.find(name="p").getText().strip().replace('   ', '\n').replace('...\n                            ', '...\n').replace(' / ', '\n').split('\n')  # 头皮发麻字符串分解系列，因为练习没用 re，果然原生字符串处理麻烦的一匹，strip去除空格，replace替换，旨在将不同信息分类存储到不同的参数，如导演、主演、上映时间、上映时间和电影类型bd = movie.find(name="p").getText().strip().replace('   ', '\n').replace('...\n                            ', '...\n').replace(' / ', '\n').split('\n')  # 头皮发麻字符串分解系列，因为练习没用 re，果然原生字符串处理麻烦的一匹，strip去除空格，replace替换，旨在将不同信息分类存储到不同的参数，如导演、主演、上映时间、上映时间和电影类型# 豆瓣有些主演没有。。。贼蛋疼，为了简便只能写个烂代码再增加一次了if len(bd) == 4:bd.insert(1, '没爬到')inq = movie.find(name="span", attrs={"class": "inq"})# 处理 inq 为空的情况if not inq:inq = "暂无"else:inq = inq.getText()# 这里直接存储到字典，方便存到数据库douBanDict['rank'] = rankdouBanDict['name'] = namedouBanDict['director'] = bd[0]douBanDict['actor'] = bd[1]douBanDict['release_time'] = bd[2].strip()  # 某些列表有空格，直接strip()去除空格douBanDict['country'] = bd[3]douBanDict['movie_types'] = bd[4]douBanDict['rating_num'] = rating_numdouBanDict['inq'] = inqdouBanList.append(str(douBanDict))  # 字典先转为字符串再累加到列表中，否则无法字典值会一直变return douBanListif __name__ == '__main__':douBanList = []douBanDict = {}for start in range(0, 250, 25):url = 'https://movie.douban.com/top250?start={}&filter='.format(start)html = get_page(url)douBanList = parse_page(html)print(douBanList)

数据存储

直接是列表格式，同时包含各个电影信息的字典。

done！

Python3爬虫项目集：豆瓣电影排行榜top250相关推荐

爬虫项目之豆瓣电影排行榜前10页
目录一.学习资源: 二.知识点介绍 1.urlib库的基本使用 2.使用实例 ①获取网页源码 ②从服务器下载网页.图片.视频 3.UA介绍 ①简介 ②实例三.项目详细讲解 1.分析 2.步骤 ...
python爬虫爬取豆瓣电影排行榜并通过pandas保存到Excel文件当中
我们的需求是利用python爬虫爬取豆瓣电影排行榜数据,并将数据通过pandas保存到Excel文件当中(步骤详细) 我们用到的第三方库如下所示: import requests import pan ...
[爬虫] 爬取豆瓣电影排行榜
申明:本文对爬取的数据仅做学习使用,不涉及任何商业活动,侵删爬取豆瓣电影排行榜这是一个Scrapy框架入门级的项目, 它可以帮助我们基本了解Scrapy的操作流程和运行原理这次我们要做例子的网站 ...
python爬虫爬取豆瓣电影排行榜，并写进csv文件，可视化数据分析
#1.爬取内容,写进csv文件 import requests import re import csv #豆瓣电影排行榜,写进csv文件 url = "https://movie.doub ...
Scrapy框架学习 - 爬取豆瓣电影排行榜TOP250所有电影信息并保存到MongoDB数据库中
概述利用Scrapy爬取豆瓣电影Top250排行榜电影信息,并保存到MongoDB数据库中使用pymongo库操作MOngodb数据库没有进行数据清洗源码 items.py class Dou ...
爬取豆瓣电影排行榜top250
爬取豆瓣电影top250 平时不知道看什么电影,正好最近学习了爬虫,自己试着把电影排行下载下来,边看边学两不误. 下面直接上代码: import requests from bs4 import Be ...
python爬电影排名用os bs4_Pyhton网络爬虫实例_豆瓣电影排行榜_BeautifulSoup4方法爬取...
-----------------------------------------------------------学无止境------------------------------------- ...
爬虫爬取豆瓣电影排行榜
import requests import re # 此模块专门用来提取有效信息url = 'https://movie.douban.com/top250' head = {'User-Agent ...
Python3爬虫项目集：爬取知乎十几万张小姐姐美图
文章目录前言注意点代码实例前言 github:https://github.com/pasca520/Python3SpiderSet 知乎上有很多钓鱼贴,也成功的钓上了很多鱼,你懂的~~~ ...
爬虫实例之豆瓣电影排行榜
from bs4 import BeautifulSoup from lxml import html import xml import requests#不加头部直接爬取的话,返回值为空 head ...