python 豆瓣电影top250_python 爬豆瓣电影top250
基础页面:https://movie.douban.com/top250
代码:
from time import sleep
from requests import get
from bs4 import BeautifulSoup
import re
import pymysql
db = pymysql.connect(host='localhost',
user='root',
password='123456',
db='douban',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor
)
try:
with db.cursor() as cursor:
sql = "CREATE TABLE IF NOT EXISTS `top250` (" \
"`id` int(6) NOT NULL AUTO_INCREMENT," \
"`top` int(6) NOT NULL," \
"`page-code` int(6) NOT NULL," \
"`title` varchar(255) NOT NULL," \
"`origin-title` varchar(255)," \
"`score` float NOT NULL," \
"`theme` varchar(255) NOT NULL," \
"PRIMARY KEY(`id`)" \
") ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;"
cursor.execute(sql,)
finally:
db.commit()
base_url = 'https://movie.douban.com/top250'
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'xxx',
'Host': 'movie.douban.com',
'Referer': 'https://movie.douban.com/chart',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'xxx'
}
def crawler(url=None, headers=None, delay=1):
r = get(url=url, headers=headers, timeout=3)
soup = BeautifulSoup(r.text, 'html.parser')
page_tag = soup.find('span', attrs={'class': 'thispage'})
page_code = re.compile(r'(.*)').findall(str(page_tag))[0]
movie_ranks = soup.find_all('em', attrs={'class': ''})
movie_titles = soup.find_all('div', attrs={'class': 'hd'})
movie_scores = soup.find_all('span', attrs={'class': 'rating_num'})
movie_themes = soup.find_all('span', attrs={'class': 'inq'})
next_page = soup.find('link', attrs={'rel': 'next'})
for ranks, titles, scores, themes in zip(movie_ranks, movie_titles, movie_scores, movie_themes):
rank = re.compile(r'(.*)').findall(str(ranks))
regex_ts = re.compile(r'(.*)').findall(str(titles))
title = regex_ts[0]
score = re.compile(r'(.*)').findall(str(scores))[0]
theme = re.compile(r'(.*)').findall(str(themes))[0]
try:
origin_title = regex_ts[1]
origin_title = re.compile(r'./.(.+)').findall(origin_title)[0]
with db.cursor() as cursor:
sql = "INSERT INTO `top250` (`top`, `page-code`, `title`, `origin-title`, `score`, `theme`)" \
" VALUES (%s, %s, %s, %s, %s, %s)"
cursor.execute(sql, (rank, page_code, title, origin_title, score, theme,))
except IndexError:
with db.cursor() as cursor:
sql = "INSERT INTO `top250` (`top`, `page-code`, `title`, `score`, `theme`)" \
" VALUES (%s, %s, %s, %s, %s)"
cursor.execute(sql, (rank, page_code, title, score, theme,))
finally:
db.commit()
if next_page is not None:
headers['Referer'] = url
next_url = base_url + re.compile(r'').findall(str(next_page))[0]
sleep(delay)
crawler(url=next_url, headers=headers, delay=3)
crawler(base_url, header, 0)
db.close()
结果:
mysql> select top,title,score from top250 where id = 175;
+-----+--------+-------+
| top | title | score |
+-----+--------+-------+
| 176 | 罗生门 | 8.7 |
+-----+--------+-------+
1 row in set (0.00 sec)
mysql> select top,title,page-code,score from top250 where id = 175;
ERROR 1054 (42S22): Unknown column 'page' in 'field list'
mysql> select top,page-code,title,score from top250 where id = 175;
ERROR 1054 (42S22): Unknown column 'page' in 'field list'
mysql> select page-code from top250 where id = 175;
ERROR 1054 (42S22): Unknown column 'page' in 'field list'
mysql> describe top250
-> ;
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| id | int(6) | NO | PRI | NULL | auto_increment |
| top | int(6) | NO | | NULL | |
| page-code | int(6) | NO | | NULL | |
| title | varchar(255) | NO | | NULL | |
| origin-title | varchar(255) | YES | | NULL | |
| score | float | NO | | NULL | |
| theme | varchar(255) | NO | | NULL | |
+--------------+--------------+------+-----+---------+----------------+
7 rows in set (0.32 sec)
mysql> select page-code from top250 where id = 175;
ERROR 1054 (42S22): Unknown column 'page' in 'field list'
mysql> select origin-title from top250 where id = 175;
ERROR 1054 (42S22): Unknown column 'origin' in 'field list'
mysql> select origin_title from top250 where id = 175;
ERROR 1054 (42S22): Unknown column 'origin_title' in 'field list'
mysql> select * from top250 where id = 175;
+-----+-----+-----------+--------+--------------+-------+-------------------+
| id | top | page-code | title | origin-title | score | theme |
+-----+-----+-----------+--------+--------------+-------+-------------------+
| 175 | 176 | 8 | 罗生门 | 羅生門 | 8.7 | 人生的N种可能性。 |
+-----+-----+-----------+--------+--------------+-------+-------------------+
1 row in set (0.00 sec)
mysql> select * from top250 where title = 未麻的部屋;
ERROR 1054 (42S22): Unknown column '未麻的部屋' in 'where clause'
mysql> select * from top250 where top=175;
Empty set (0.00 sec)
mysql>
两个小问题:
1.没想到数据库字段不能用'-'...,于是page-code字段与origin-title字段不能独立进行查找。。。
2.不知道为啥top175的电影《未麻的部屋》没爬到。。。
python 豆瓣电影top250_python 爬豆瓣电影top250相关推荐
- python爬虫——三步爬得电影天堂电影下载链接,30多行代码即可搞定:
python爬虫--三步爬得电影天堂电影下载链接,30多行代码即可搞定: 本次我们选择的爬虫对象是:https://www.dy2018.com/index.html 具体的三个步骤:1.定位到202 ...
- python使用多线程进行爬豆瓣电影top250海报图片,附源码加运行结果
使用多线程进行爬豆瓣电影top250海报图片 # -- coding: UTF-8 -- import time import requests import urllib.request from ...
- Python小工具-电影天堂爬取电影下载链接
import requests import bs4# 获取单独的url def movie_info(url):'''内容标签:<div id="Zoom">下载链接 ...
- python爬取豆瓣电影top250_Python 爬取豆瓣电影Top250排行榜,爬虫初试
from bs4 import BeautifulSoup import openpyxl import re import urllib.request import urllib.error # ...
- 一个简单python爬虫的实现——爬取电影信息
最近在学习网络爬虫,完成了一个比较简单的python网络爬虫.首先为什么要用爬虫爬取信息呢,当然是因为要比人去收集更高效. 网络爬虫,可以理解为自动帮你在网络上收集数据的机器人. 网络爬虫简单可以大致 ...
- python程序爬电影_Python爬取电影天堂最新发布影片消息
从今天开始我会把我学习python爬虫的一些心得体会和代码发布在我现在的博客,好记性不如烂笔头,以便以后的我进行复习. 虽然我现在的爬虫还很幼小,希望有一天她能长得非常非常的强大. --------- ...
- python爬虫六:爬取电影图片及简介
# -*- coding: utf-8 -*-#2345电影排行榜 import requests from bs4 import BeautifulSoup#获取网站的通用类 def get_htm ...
- 使用python爬取电影下载地址并使用transmissionrpc下载
说明 python练手,爬取电影天堂的新电影,获取到磁力链接,输出到日志文件,使用transmissionrpc下载, 涉及知识点: 1.python 操作mongodBD,参考文档 2.Beauti ...
- 爬虫(9)实战爬取电影天堂的1000+最新电影
文章来因: 客官们,久等了,在家上网课,上的无聊,想看个电影放松一下,但是却不知道看啥电影,想起最近学习的爬虫,于是找电影天堂爬个电影信息,不就知道看那个电影了,上菜 菜单 文章来因: 实战内容:直接 ...
最新文章
- linux严谨的telnet搭建并用防火墙开通与禁行
- postgresql主从备份_基于PG12.2实现主从异步流复制及主从切换教程(下)
- ZZ--是谁送走了我们的同事
- php linux 缓存文件,Linux下搭建网站提示缓存文件写入失败怎么办?
- 深入浅出Flex组件生命周期Part4 ─ 引擎LayoutManager【转载】
- 将阿拉伯数字转换成中文大写的好算法
- Python机器学习:逻辑回归005决策边界
- Python 字符串(一)
- 企业微信接收消息服务器配置php,微信企业号配置及在公众号里获取用户信息
- mshtml 解析html c,使用MSHTML解析HTML代码
- 这些年的项目管理心得
- 【ubuntu】SSH安装及配置
- TBODY标签的作用介绍
- 深圳app上架-2021年上半年android ios app上架价格一览
- 解决VMware虚拟机由于不小心更改文件路径导致的桌面图标变白,运行exe程序显示找不到路径的错误
- 论文阅读Construction of Refined Protein Interaction Network for Predicting Essential Proteins
- 中央电大 c语言程序设计a 试题,最新-中央电大2008年秋C语言程序设计A试题1.doc...
- 处处吻(粤语汉字英译)
- 如何做服务器安全维护
- html语言期末考试,HTML期末考试复习题及参考答案
热门文章
- java 同比数据怎么算的_有当日数据和去年全年数据,如何通过公式做每日同比?...
- inner join 和outer join的区别
- SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in ... ——2022 CVPR 论文笔记
- 23年教资面试开始啦个人报名流程
- 支持向量机检测DGA
- DrawBoard 是一个自定义 View 实现的画板;方便对图片进行各种编辑或涂鸦相关操作
- ROSERROR : C++ filt命令
- Android开源项目以及开源库集合(持续更新中)
- leetcode 810. Chalkboard XOR Game
- 881. 救生艇-快速排序加贪心算法