写的第一个还算有点复杂的Python的程序,有点意思,感觉Python的实用性和开发效率实在很优秀,O(∩_∩)O哈哈~
源代码在最后,有兴趣的可以试试跑一下。
爬虫地址豆瓣电影 Top 250.

文章目录

  • 效果展示
  • 技术点
    • 简单的面向对象
    • 通过url下载图片
    • 写入Excel
    • 使用os库
    • 爬虫模块
  • 源代码

效果展示

技术点

简单的面向对象

主要是

class Movie:def __init__(self, rank, name, other_name, directors, actors, year, country, kind, star, persons,quote, img_url):self.rank = rankself.name = nameself.other_name = other_nameself.directors = directorsself.actors = actorsself.year = yearself.country = countryself.kind = kindself.star = starself.persons = personsself.quote = quoteself.img_url = img_urldef __str__(self) -> str:return "排名: %s\n电影名: %s\n别名: %s\n导演: %s\n演员: %s\n年份: %s\n国家: %s\n类别: %s\n评分: %s\n评价人数: %s\n评价: %s\n" \% (self.rank, self.name, self.other_name, self.directors, self.actors, self.year, self.country,self.kind, self.star, self.persons, self.quote)def toAttrList(self) -> List:res = [self.rank, self.name, self.other_name, self.directors, self.actors, self.year, self.country,self.kind, self.star, self.persons, self.quote]return res

通过url下载图片

def download_jpg(img_url, img_name=""):res = requests.get(img_url, headers=getHeader(),stream=True)if len(img_name) == 0:filename = img_url.split(":", 1)[1]filename = filename.replace("/", ".")else:filename = img_name# w表示可写, b是字节流with open(filename, "wb") as f:f.write(res.content)

写入Excel

    # 创建表格wb = openpyxl.Workbook()# 创建表格的一个sheetws = wb.create_sheet(index=0, title='豆瓣电影Top250')# 写入表头ws.append(["排名", "电影名", "别名", "导演", "演员", "年份", "国家", "类别", "评分", "评价人数", "评价"])urls = []for movie in movies:# 将一个 list类型作为表格的一行的写入 excelws.append(movie.toAttrList())# 保存excelwb.save("豆瓣电影Top250统计.xlsx")

使用os库

使用os获得操作系统的api,就像操作命令行一样。

    os.mkdir("豆瓣电影Top250统计")os.chdir("豆瓣电影Top250统计")# # 批量下载图片os.mkdir("豆瓣电影Top250图片保存")os.chdir("豆瓣电影Top250图片保存")

爬虫模块

主要运用到 requests、bs4库进行解析网页,提取到我们想要的消息。

源代码

import random
import re
import os
from typing import List
import requests
from bs4 import BeautifulSoup
import openpyxluser_agent = ["Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50","Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0","Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)","Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)","Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11","Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)","Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5","Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5","Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5","Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13","Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+","Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0","Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124","Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)","UCWEB7.0.2.37/28/999","NOKIA5700/ UCWEB7.0.2.37/28/999","Openwave/ UCWEB7.0.2.37/28/999","Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",# iPhone 6:"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",]# 随机获取一个请求头
def getHeader():return {'User-Agent': random.choice(user_agent)}class Movie:def __init__(self, rank, name, other_name, directors, actors, year, country, kind, star, persons,quote, img_url):self.rank = rankself.name = nameself.other_name = other_nameself.directors = directorsself.actors = actorsself.year = yearself.country = countryself.kind = kindself.star = starself.persons = personsself.quote = quoteself.img_url = img_urldef __str__(self) -> str:return "排名: %s\n电影名: %s\n别名: %s\n导演: %s\n演员: %s\n年份: %s\n国家: %s\n类别: %s\n评分: %s\n评价人数: %s\n评价: %s\n" \% (self.rank, self.name, self.other_name, self.directors, self.actors, self.year, self.country,self.kind, self.star, self.persons, self.quote)def toAttrList(self) -> List:res = [self.rank, self.name, self.other_name, self.directors, self.actors, self.year, self.country,self.kind, self.star, self.persons, self.quote]return resdef download_jpg(img_url, img_name=""):res = requests.get(img_url, headers=getHeader(),stream=True)if len(img_name) == 0:filename = img_url.split(":", 1)[1]filename = filename.replace("/", ".")else:filename = img_namewith open(filename, "wb") as f:f.write(res.content)def craw() -> List:top250Movies = []image_urls = []link = "https://movie.douban.com/top250"for pageNumber in range(10):# 挨个爬取10个网页。一个网页里有url = link + "?start=" + str(pageNumber * 25)headers = {'User-Agent': "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"}res = requests.get(url, headers=getHeader(), timeout=1)if res.status_code != 200:continuesoup = BeautifulSoup(res.text, 'lxml')itemList = soup.find_all("div", class_="item")for item in itemList:pic_tag = item.find("div", class_="pic")rank = pic_tag.em.text# print(rank)movie_detailed_page_url = pic_tag.a['href']# title = pic_tag.a.img['alt']movie_img_url = pic_tag.a.img['src']# print(movie_img_url, "  ", rank)info_tag = item.find("div", class_="info")hd_tag = info_tag.find("div", class_="hd")bd_tag = info_tag.find("div", class_="bd")titleList = hd_tag.a.find_all("span", class_=["title", "other"])titles = []for title_el in titleList:titles.append(title_el.text)titles = "".join(titles)titles = titles.split("/")name = titles[0]other_name = ""if len(titles) > 1:other_name = "".join(titles[1:])first_p_tag = bd_tag.find("p")# print(first_p_tag.text.strip())descriptions = first_p_tag.contentsdescriptions.pop(1)directors_and_actors = descriptions[0].strip()if not re.search("导演:", directors_and_actors) is None:index_tuple0 = re.search("导演:", directors_and_actors).span()if not re.search("主演:", directors_and_actors) is None:index_tuple1 = re.search("主演:", directors_and_actors).span()directors = ""actors = ""if not re.search("导演:", directors_and_actors) is None and not re.search("主演:",directors_and_actors) is None:directors = directors_and_actors[index_tuple0[1]:index_tuple1[0]].strip()actors = directors_and_actors[index_tuple1[1]:].strip()description = descriptions[1].split("/")for i in range(len(description)):description[i] = description[i].strip()year = description[0]country = description[1]kind = description[2]star_tag = bd_tag.find("div", class_="star")star = star_tag.find("span", class_="rating_num").text# print(star)# print(star_tag,end='\n\n')# div 会被解释为 换行符persons = star_tag.contents[-2].text# 正则的切片index_tp = re.search("\d+", persons).span()persons = persons[index_tp[0]:index_tp[1]]# print(persons)# 和上面一样,写一个元素是换行quote_tag = bd_tag.find("p", class_="quote")if type(quote_tag) != type(bd_tag):continuequote = quote_tag.text.strip()# print(quote)movie = Movie(rank=rank, name=name, other_name=other_name, directors=directors, actors=actors, year=year,country=country, kind=kind, star=star, persons=persons, quote=quote, img_url=movie_img_url)# print(movie)top250Movies.append(movie)return top250Moviesdef save():movies = craw()# 写入 excelwb = openpyxl.Workbook()ws = wb.create_sheet(index=0, title='豆瓣电影Top250')ws.append(["排名", "电影名", "别名", "导演", "演员", "年份", "国家", "类别", "评分", "评价人数", "评价"])urls = []for movie in movies:ws.append(movie.toAttrList())urls.append((movie.img_url, movie.name + ".jpg"))os.mkdir("豆瓣电影Top250统计")os.chdir("豆瓣电影Top250统计")wb.save("豆瓣电影Top250统计.xlsx")# # 批量下载图片os.mkdir("豆瓣电影Top250图片保存")os.chdir("豆瓣电影Top250图片保存")for url in urls:download_jpg(url[0],url[1])save()

爬取豆瓣Top 250电影信息、下载图片、存储到Excel,快来试试吧!相关推荐

  1. Python 爬虫 爬取豆瓣Top 250 并将海报图片爬取下来保存

    本文章的所有代码和相关文章, 仅用于经验技术交流分享,禁止将相关技术应用到不正当途径,滥用技术产生的风险与本人无关. 本文章是自己学习的一些记录. 爬取豆瓣top 250 现在的很多学习的教程例子都是 ...

  2. Scrapy 框架:爬取豆瓣Top 250

    使用Scrapy爬取豆瓣电影Top250 鲁迅说,豆瓣排行榜这么多,不爬一爬可惜了. 第一步:安装Scrapy 安装命令: pip3 install scrapy win用户一般来说第一次都不会太顺利 ...

  3. 爬虫如何翻页 爬取豆瓣排名250电影

    1.爬虫翻页问题: 正在做爬虫练习:爬取豆瓣电影排名前250的电影,但一页只显示排名前25的电影,爬取250部电影就遇到了爬虫翻页的问题.记录下,希望帮助到正在学习的小伙伴! 2.爬虫翻页解决思路: ...

  4. 爬取豆瓣前250电影数据

    1.导入包 import requests from bs4 import BeautifulSoup import pandas as pd 2. 构造分页数字列表 page_indexs = li ...

  5. python3.5.4爬取豆瓣中国内地电影票房总排行输出到excel

    首先,作为练手,我也是看别人的博客文章学习写爬虫的,过程中遇到很多问题,不过经过自己的努力完成了此项任务,虽然过程波折,但是收获不会少,作为自学可谓加深印象.先说下需求,使用Python3.5版本获取 ...

  6. 爬去豆瓣网中电影信息并保存到本地目录当中

    爬取豆瓣网中电影信息并保存到本地目录当中 读者可以根据源代码来设计自己的爬虫,url链接不能通用,由于源代码中后续查找筛选中有不同类或者标签名,仅供参考,另外推荐b站上一个老师,叫路飞学城IT的,讲的 ...

  7. 请访问豆瓣电影网站,爬取4~10部电影信息(电影名、导 演、演员、海报url链接,预报片视频链接),并结合GUI界面展现电影信息,并可以根据选择的电影名, 下载指定预告片视频到本地并显示预告片。GUI

    请访问豆瓣电影网站,爬取4~10部电影信息(电影名.导 演.演员.海报url链接,预报片视频链接),并结合GUI界面展现电影信息,并可以根据选择的电影名, 下载指定预告片视频到本地并显示预告片.GUI ...

  8. requests爬取豆瓣前250部高分电影

    这两天又写了一个爬取豆瓣前250部高分电影的爬虫,并把电影名字和图片保存到本地. 用的是requests和BeautifulSoup. @requires_authorization import r ...

  9. 爬取豆瓣读书的图书信息和评论信息

    最近在做毕业设计,需要收集用户的评分数据做协同过滤算法,同时收集评论数据做情感分析 坑点 豆瓣图书可以没有评分,或者用户评论了但没给评分.而且豆瓣图书的编码方式很无奈呀,热门书籍附近总是冷门书籍,无评 ...

最新文章

  1. 有关博弈人机混合智能的再思考
  2. android .9图制作
  3. 解决xcode升级之后安装的插件失效
  4. apache 服务发布多个项目,只需要更改配置文件(需要设定虚拟主机)
  5. C语言插入排序Insertion Sort算法(附完整源码)
  6. php7和php8内核有区别吗,不要在PHP7中踩这些坑
  7. 软件开发报价模板_定制开发小程序和行业通用(模板)小程序的利弊分析
  8. 编辑器内容FCKeditor的js验证以及FCKeditor内容是否为空判断
  9. SAP License:SAP有限度的多语言支持
  10. 卷积神经网络架构理解
  11. 计算机设计思想 —— 解耦(分离)与内聚
  12. 以Flappy Bird破解为例介绍andriod apk加壳方法
  13. python 函数重载_python中有函数重载吗
  14. js 定义函数的几种方法 以及如何调用
  15. 无线路由器建立usb共享打印服务器,无线路由器USB网络共享管理设置方法
  16. Cadence导入AD的pcb文件中元件的封装
  17. Serializer的功能
  18. Android内存优化大全(二)
  19. 语句摘抄——第12周
  20. javascript数组(array)的常用方法(shift/unshift/pop/push/concat/splice/reverse/sort/slice/join)

热门文章

  1. hive 十六进制转十进制_Hive使用十六进制分隔符异常分析
  2. Shopees市场以及热销品了解一下~kkgj66
  3. 类的继承和派生java_python 类的继承和派生
  4. SSM框架_Spring5
  5. 我的世界服务器自定义武器插件,[转载插件][服务器端插件][1.7.X]custommobs小白的自定义怪物...
  6. 关于Tomcat 环境变量的配置
  7. ElasticSearch(六)【分词器】
  8. 2021-10-31 九九乘法表
  9. 未来二维码发展趋势:从“吃穿住行”到“生死救援”的华丽转身
  10. layui多行合并,附示例代码