python爬虫豆瓣影评的爬取cookies实现自动登录账号

频繁的登录网页会让豆瓣锁定你的账号……

网页请求

使用cookies来实现的自动登录账号，这里的cookies因为涉及到账号我屏蔽了，具体的cookies获取方法直接可以让浏览器实现自动登录后，在网页请求信息中自己找到。

def askURL(url):head = {"User-Agent": "Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 77.0.3865.90Safari / 537.36"}cookies ={"Cookie":' ***********************'}# request = urllib.request.Request(url, headers=head)# html = ""# response = urllib.request.urlopen(request)# html = response.read().decode("utf-8")html = requests.get(url,cookies=cookies,headers=head)print("网站返回成功")return html.text

获取数据代码片段

再看豆瓣影评的时候，我发现他的所有评论我没有办法完全获取下来
他这里的评论我没理解错的话应该是31万+的评论，但是实际获取的时候在26页以后就什么都没有了。

re正则表达式

findCritic = re.compile(r'<span class="short">(.*?)</span>',re.S)
findUser = re.compile(r'<a href=.*? title="(.*?)">',re.S)
findScore = re.compile(r'<span class="(.*?)" title=')

具体方法

def getDate(base_url):datelist = []for i in range(0,25):url = base_url +str(i*20)html = askURL(url)print("第{0}页".format(i+1))soup = BeautifulSoup(html,"html.parser")for item in soup.find_all('div',class_ = "comment-item"):date = []item = str(item)#print(item)user = re.findall(findUser,item)date.append(user)score = re.findall(findScore, item)[0]date.append(score)critic = re.findall(findCritic,item)date.append(critic)datelist.append(date)return datelist

数据库保存

这里因为处理用户名中含有单引号的问题给我搞得有点傻，使用str.replace()先把用户名中的单引号变为空格，再将字符串格式的两边双引号变为单引号，最后才满足的数据库插入格式。
如果有大佬有更好的解决办法可以评论区告诉我。

def saveDate_DB(datelist,dbpath):init_DB(dbpath)conn = sqlite3.connect(dbpath)cursor = conn.cursor()for date in datelist:for index in range(len(date)):date[index] = str(date[index])date[index] = date[index].replace("'"," ");date[index] = date[index].replace('"', "'");date[index] = '"'+str(date[index])+'"'sql = '''insert into bawangbieji(author ,score ,critics)values(%s)'''%",".join(date)#print(sql)cursor.execute(sql)conn.commit()conn.close()print("保存到数据库",dbpath)

python爬虫豆瓣影评的爬取cookies实现自动登录账号相关推荐

Python爬虫【二】爬取PC网页版“微博辟谣”账号内容(selenium同步单线程)
专题系列导引爬虫课题描述可见: Python爬虫[零]课题介绍 – 对"微博辟谣"账号的历史微博进行数据采集课题解决方法: 微博移动版爬虫 Python爬虫[一]爬取移 ...
Python爬虫【四】爬取PC网页版“微博辟谣”账号内容(selenium多线程异步处理多页面)
专题系列导引爬虫课题描述可见: Python爬虫[零]课题介绍 – 对"微博辟谣"账号的历史微博进行数据采集课题解决方法: 微博移动版爬虫 Python爬虫[一]爬取移 ...
Python爬虫【三】爬取PC网页版“微博辟谣”账号内容(selenium单页面内多线程爬取内容)
专题系列导引爬虫课题描述可见: Python爬虫[零]课题介绍 – 对"微博辟谣"账号的历史微博进行数据采集课题解决方法: 微博移动版爬虫 Python爬虫[一]爬取移 ...
Python爬虫入门 | 7 分类爬取豆瓣电影，解决动态加载问题
比如我们今天的案例,豆瓣电影分类页面.根本没有什么翻页,需要点击"加载更多"新的电影信息,前面的黑科技瞬间被秒-- 又比如知乎关注的人列表页面: 我复制了其中两个人昵称 ...
python爬虫（一）爬取豆瓣电影排名前50名电影的信息
python爬虫(一)爬取豆瓣电影排名前50名电影的信息在Python爬虫中,我们可以使用beautifulsoup对网页进行解析. 我们可以使用它来爬取豆瓣电影排名前50名的电影的详细信息,例如排 ...
Python爬虫菜鸟入门，爬取豆瓣top250电影（自己学习，如有侵权，请联系我删除）
Python爬虫菜鸟入门,爬取豆瓣top250电影 (自己学习,如有侵权,请联系我删除) import requests from bs4 import BeautifulSoup import ti ...
数据采集与存储案例——基于Python爬虫框架Scrapy的爬取网络数据与MySQL数据持久化
此案例需要预先安装pymsql python3.7.4 scrapy2.7.1 一.安装scrapy框架 1.使用pip命令安装scrapy pip install scrapy 在这里下载太慢可以使 ...
python爬取图片教程-推荐|Python 爬虫系列教程一爬取批量百度图片
Python 爬虫系列教程一爬取批量百度图片https://blog.csdn.net/qq_40774175/article/details/81273198# -*- coding: utf-8 ...
python爬虫对炒股有没有用_使用python爬虫实现网络股票信息爬取的demo
实例如下所示: import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText(url ...

python爬虫豆瓣影评的爬取cookies实现自动登录账号

python爬虫豆瓣影评的爬取cookies实现自动登录账号

网页请求

获取数据代码片段

数据库保存

python爬虫豆瓣影评的爬取cookies实现自动登录账号相关推荐

最新文章

热门文章

python爬虫 豆瓣影评的爬取cookies实现自动登录账号

python爬虫 豆瓣影评的爬取cookies实现自动登录账号

网页请求

获取数据代码片段

数据库保存

python爬虫 豆瓣影评的爬取cookies实现自动登录账号相关推荐

最新文章

热门文章

python爬虫豆瓣影评的爬取cookies实现自动登录账号

python爬虫豆瓣影评的爬取cookies实现自动登录账号

python爬虫豆瓣影评的爬取cookies实现自动登录账号相关推荐