爬虫练习（Day4）

项目实战

实战大项目：
模拟登录丁香园，并抓取论坛页面所有的人员基本信息与回复帖子内容。
丁香园论坛：http://www.dxy.cn/bbs/thread/626626#626626 。

其实这个大作业最难的是模拟登陆，使用header的cookie可以实现
代码如下：

import requests
from bs4 import BeautifulSoup
def getHTMLText(url):try:user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'cookie = '1092bbdd-42b0-4500-a9ff-8e37bab512dd'  # 输入自己的cookiecookie = cookie.encode('utf-8')headers = {'User_agent': user_agent, 'Cookie': cookie}r = requests.get(url, headers=headers)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:traceback.print_exc()return ''def parsePage(text):htmlInfo = {}soup = BeautifulSoup(text, 'html.parser')auths = soup.find_all('div', attrs={'class': 'auth'})i = 0for auth in auths:htmlInfo[i] = {}htmlInfo[i]['name'] = auth.texti += 1i = 0levels = soup.findAll('div', attrs={'class': 'info clearfix'})for level in levels:level1 = level.find_all('div')if level1:htmlInfo[i]['level'] = level1[-1].text.strip()else:htmlInfo[i]['level'] = level.find('p').text.strip()i += 1i = 0user_attens = soup.findAll('div', attrs={'class': 'user_atten'})for user_atten in user_attens:for user_attr in user_atten.select('li'):user_attr_str = user_attr.texthtmlInfo[i][user_attr_str[-2:]] = user_attr_str[:-2]i += 1tds = soup.find_all('td', attrs={'class': 'postbody'})i = 0for td in tds:content = ''for string in td.stripped_strings:content += string + ' 'htmlInfo[i]['content'] = contenti += 1return htmlInfodef printHTMLInfo(htmlInfo):print(f'name\t\tlevel\t\t\tscore\tvote\tdingdang\tcontent')htmlInfo = list(htmlInfo.values())[:-1]for value in htmlInfo:print(f"{value['name']:10}\t{value['level']:14}\t{value['积分']}\t{value['得票']}\t{value['丁当']}\t\t{value['content']}",end='\n\n')def main():url = "http://www.dxy.cn/bbs/thread/626626"text = getHTMLText(url)htmlInfo = parsePage(text)printHTMLInfo(htmlInfo)
main()

爬虫练习（Day4）相关推荐

第六周——爬虫入门 Day4 8.4
学习时间:9:00--12:00 15:00--16:30 图片数据爬取之ImagesPipeline -基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别? -字符串: ...
爬虫自学day4：requests模块之爬取豆瓣电影分类排行榜
豆瓣电影排行榜界面: 选择喜剧分类: 我们要爬取的数据是:电影名称.导演.演员.上映时间.国家等这些数据. 如何进行爬取: 这些信息是当前页面的局部信息,那么是否会遇到数据解析. 除了数据解析还可以使 ...
[day4]python网络爬虫实战：爬取美女写真图片(Scrapy版)
l> 我的新书<Android App开发入门与实战>已于2020年8月由人民邮电出版社出版,欢迎购买.点击进入详情文章目录 1.开发环境 2.第三方库 3.Scrapy简介 4. ...
Python—实训day4—爬虫案例3：贴吧图片下载
6 xpath 首先需要安装Google的Chrome浏览器 6.1 安装xpath插件把 xpath_helper_2_0_2.crx 修改后缀名为 xpath_helper_2_0_2.rar. ...
[JavaWeb实训Day4]__jsoup爬虫(爬新闻页面)词云的生成( kumo库)
目录一.本次实验分析过程二.Jee连接MySQL数据库三.异步传值及界面设计四.jsoup爬虫爬取新闻网页五.新闻词云的生成( kumo库) 六.存入数据库快速链接:[JavaWeb项目实 ...
Python Day4 爬虫-selenium滚动和常见反爬
Day4 selenium滚动和常见fanpa 文章目录 Day4 selenium滚动和常见fanpa 1. zhi网页面数据分析 2. 页面滚动 3. requests的自动登录 4. selen ...
软件工程Alpha冲刺day4
这个作业属于哪个课程构建之法-2021秋-福州大学软件工程这个作业要求在哪里 2021秋软工实践alpha冲刺团队名称测码奔腾这个作业的目标 Alpha冲刺(day4) 今日进度成员姓名 ...
【视频教程免费领取】聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎
领取方式关注公众号,发送Python0407获取下载链接. 扫码关注公众号,公众号回复 Python0407 获取下载地址目录结构目录:/读书ReadBook [57.6G] ┣━━48G全套J ...
百度Aistudio飞桨七日游体验python爬虫和分析数据
前言在某天,老妹给我发了一个截图,百度飞桨举办小白入门到大神的python,而且还有奖品.最近玩拼多多的多多消游戏第133关卡了一个星期废话(建议体验前期智商碾压游戏后期靠游戏眷顾) ,回归主 ...

爬虫练习（Day4）

项目实战

爬虫练习（Day4）相关推荐

最新文章

热门文章