爬虫（七）：爬取猫眼电影top100

一：分析网站

目标站和目标数据
目标地址：http://maoyan.com/board/4?offset=20
目标数据：目标地址页面的电影列表，包括电影名，电影图片，主演，上映日期以及评分。

二：上代码

（1）：导入相应的包

import requests
from requests.exceptions import RequestException # 处理请求异常
import re
import pymysql
import json
from multiprocessing import Pool

（2）：分析网页

通过检查发现需要的内容位于网页中的<dd>标签内。通过翻页发现url中的参数的变化。

（3）：获取html网页

# 获取一页的数据
def get_one_page(url):# requests会产生异常headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',}try:response = requests.get(url, headers=headers)if response.status_code == 200:  # 状态码是200表示成功return response.textelse:return Noneexcept RequestException:return None

（4）：通过正则提取需要的信息 --》正则表达式详情

# 解析网页内容
def parse_one_page(html):pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?class="name"><a.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)  # re.S可以匹配任意字符包括换行items = re.findall(pattern, html)  # 将括号中的内容提取出来for item in items:yield {  # 构造一个生成器'index': item[0].strip(),'title': item[2].strip(),'actor': item[3].strip()[3:],'score': ''.join([item[5].strip(), item[6].strip()]),'pub_time': item[4].strip()[5:],'img_url': item[1].strip(),}

（5）：将获取的内容存入mysql数据库

# 连接数据库，首先要在本地创建好数据库
def commit_to_sql(dic):conn = pymysql.connect(host='localhost', port=3306, user='mydb', passwd='123456', db='maoyantop100',charset='utf8')cursor = conn.cursor(cursor=pymysql.cursors.DictCursor)  # 设置游标的数据类型为字典sql = '''insert into movies_top_100(mid,title,actor,score,pub_time,img_url) values("%s","%s","%s","%s","%s","%s")''' % (dic['index'], dic['title'], dic['actor'], dic['score'], dic['pub_time'], dic['img_url'],)cursor.execute(sql)  # 执行sql语句并返回受影响的行数# # 提交
    conn.commit()# 关闭游标
    cursor.close()# 关闭连接conn.close()

（6）：主程序及运行

def main(url):html = get_one_page(url)for item in parse_one_page(html):print(item)commit_to_sql(item)if __name__ == '__main__':urls = ['http://maoyan.com/board/4?offset={}'.format(i) for i in range(0, 100, 10)]# 使用多进程pool = Pool()pool.map(main, urls)

（7）：最后的结果

完整代码：

# -*- coding: utf-8 -*-
# @Author  : FELIX
# @Date    : 2018/4/4 9:29import requests
from requests.exceptions import RequestException
import re
import pymysql
import json
from multiprocessing import Pool# 连接数据库
def commit_to_sql(dic):conn = pymysql.connect(host='localhost', port=3306, user='wang', passwd='123456', db='maoyantop100',charset='utf8')cursor = conn.cursor(cursor=pymysql.cursors.DictCursor)  # 设置游标的数据类型为字典sql = '''insert into movies_top_100(mid,title,actor,score,pub_time,img_url) values("%s","%s","%s","%s","%s","%s")''' % (dic['index'], dic['title'], dic['actor'], dic['score'], dic['pub_time'], dic['img_url'],)cursor.execute(sql)  # 执行sql语句并返回受影响的行数# # 提交
    conn.commit()# 关闭游标
    cursor.close()# 关闭连接
    conn.close()# 获取一页的数据
def get_one_page(url):# requests会产生异常headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',}try:response = requests.get(url, headers=headers)if response.status_code == 200:  # 状态码是200表示成功return response.textelse:return Noneexcept RequestException:return None# 解析网页内容
def parse_one_page(html):pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?class="name"><a.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)  # re.S可以匹配任意字符包括换行items = re.findall(pattern, html)  # 将括号中的内容提取出来for item in items:yield {  # 构造一个生成器'index': item[0].strip(),'title': item[2].strip(),'actor': item[3].strip()[3:],'score': ''.join([item[5].strip(), item[6].strip()]),'pub_time': item[4].strip()[5:],'img_url': item[1].strip(),}# print(items)def write_to_file(content):with open('result.txt', 'a', encoding='utf8')as f:f.write(json.dumps(content, ensure_ascii=False) + '\n')ii = 0def main(url):html = get_one_page(url)for item in parse_one_page(html):global iiprint(ii, item)ii = ii + 1commit_to_sql(item)write_to_file(item)# print(html)if __name__ == '__main__':urls = ['http://maoyan.com/board/4?offset={}'.format(i) for i in range(0, 100, 10)]# 使用多进程pool = Pool()pool.map(main, urls)

转载于:https://www.cnblogs.com/felixwang2/p/8728889.html

爬虫（七）：爬取猫眼电影top100相关推荐

爬虫，爬取猫眼电影Top100的电影名与评分
** 爬虫,爬取猫眼电影Top100的电影名与评分 ** import requests import threading import reclass maoyan_top500(threading ...
python爬电影_Python爬虫项目--爬取猫眼电影Top100榜
本次抓取猫眼电影Top100榜所用到的知识点: 1. python requests库 2. 正则表达式 3. csv模块 4. 多进程正文目标站点分析通过对目标站点的分析, 来确定网页结构, ...
【JAVA爬虫】爬取猫眼电影TOP100并将数据存入数据库
前几天的简单写了个利用JSOUP进行JAVA爬虫,里面有谈到后续版本会更新数据库操作,所以这次来更新了. 版本更新此次的版本里数据爬取部分新增了[电影主演-star]和[电影评分-score]部分, ...
猫眼html源码,50 行代码教你爬取猫眼电影 TOP100 榜所有信息
点击上方"CSDN",选择"置顶公众号" 关键时刻,第一时间送达! 今天,手把手教你入门 Python 爬虫,爬取猫眼电影 TOP100 榜信息. 作者 | 丁 ...
50 行代码教你爬取猫眼电影 TOP100 榜所有信息
点击上方"CSDN",选择"置顶公众号" 关键时刻,第一时间送达! 今天,手把手教你入门 Python 爬虫,爬取猫眼电影 TOP100 榜信息. 作者 | 丁 ...
爬虫从头学之Requests+正则表达式爬取猫眼电影top100
爬取思路当我们想要爬取一个页面的时候,我们要保证思路清晰,爬虫的思路分四个步骤,发起请求,获取响应内容,解析内容,存储内容.根据这四个内容我们的思路就很清晰.以下为具体步骤使用requests库爬 ...
python爬虫入门练习：BeautifulSoup爬取猫眼电影TOP100排行榜，pandas保存本地excel文件
传送门:[python爬虫入门练习]正则表达式爬取猫眼电影TOP100排行榜,openpyxl保存本地excel文件对于上文使用的正则表达式匹配网页内容,的确是有些许麻烦,替换出现任何的差错都会导致 ...
python爬虫猫眼电影票房_python爬取猫眼电影top100排行榜
爬取猫眼电影TOP100(http://maoyan.com/board/4?offset=90) 1). 爬取内容: 电影名称,主演, 上映时间,图片url地址保存到mariadb数据库中; 2). ...
(爬取猫眼电影TOP100的电影信息（含图片、评分等）)
爬取猫眼电影TOP100的电影信息(含图片.评分等) 让我们直接进入正题 1.导入需要的库 2.获取页面 3.分析页面 4.保存文件全部代码让我们直接进入正题对猫眼电影的网站进行分析其链接为: ...
利用正则表达式爬取猫眼电影TOP100信息
本文利用requests库和正则表达式爬取了猫眼电影TOP100电影信息,并将电影封面和标题.主演等文字信息保存在了本地.本文完整代码链接:https://github.com/iapcoder/Ma ...

爬虫（七）：爬取猫眼电影top100

爬虫（七）：爬取猫眼电影top100相关推荐

最新文章

热门文章