使用selenium爬取36氪

代码
代码分析
（图片在最下面）

from selenium import webdriverchrome_driver=r"C:\Users\yandi\AppData\Local\Programs\Python\Python37-32\chromedriver.exe"
driver=webdriver.Chrome(executable_path=chrome_driver)
driver.get('https://36kr.com/')
#获取资讯栏
driver.find_element_by_xpath('//*[@id="information"]').click()
#%%获取每一条资讯的信息
typelist,authorlist,titlelist = [],[],[]
i = 1
while(True):try:type = driver.find_element_by_xpath('//div[@class="information-flow-list"]/div['+str(i)+']//span[@class="kr-flow-bar-motif"]/a').texttypelist.append(type)author = driver.find_element_by_xpath('//div[@class="information-flow-list"]/div['+str(i)+']//a[@class="kr-flow-bar-author"]').textauthorlist.append(author)title = driver.find_element_by_xpath('//div[@class="information-flow-list"]/div['+str(i)+']//a[@class="article-item-title weight-bold"]').texttitlelist.append(title)i = i + 1if ( i % 29 == 0):print("第"+str(i/29)+"次刷新页面,请稍等，目前获取了"+str(i)+"条数据")driver.find_element_by_xpath('//*[@id="app"]/div/div[1]/div[3]/div/div/div[1]/div/div/div[3]').click()except:print("wating")if i > 301:print("爬取完成,共获取"+str(i)+"条数据")break
#%% 和并数据
import pandas as pdframe_title = pd.DataFrame(titlelist, columns=['title'])
frame_type = pd.DataFrame(typelist, columns=['type'])
frame_author = pd.DataFrame(authorlist, columns=['author'])info_frame = frame_type.join(frame_title).join(frame_author)#%% 序列化写入
import pickleb = open(r"C:\Users\yandi\PycharmProjects\MachineLearing\36氪\info_frame.pkl", "wb")
pickle.dump(info_frame,b)
b.close()

1.使用selenium工具进入36kr首页

from selenium import webdriverchrome_driver=r"C:\Users\yandi\AppData\Local\Programs\Python\Python37-32\chromedriver.exe"
driver=webdriver.Chrome(executable_path=chrome_driver)
driver.get('https://36kr.com/')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xD8cn0xO-1603186587692)(C:\Users\yandi\AppData\Roaming\Typora\typora-user-images\image-20201019155906798.png)]

2.获取资讯专栏

#获取资讯栏
driver.find_element_by_xpath('//*[@id="information"]').click()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wL2MwEDZ-1603186587694)(C:\Users\yandi\AppData\Roaming\Typora\typora-user-images\image-20201019160008281.png)]

3.获取信息

#%%获取每一条资讯的信息
typelist,authorlist,titlelist = [],[],[]
i = 1
while(True):try:type = driver.find_element_by_xpath('//div[@class="information-flow-list"]/div['+str(i)+']//span[@class="kr-flow-bar-motif"]/a').texttypelist.append(type)author = driver.find_element_by_xpath('//div[@class="information-flow-list"]/div['+str(i)+']//a[@class="kr-flow-bar-author"]').textauthorlist.append(author)title = driver.find_element_by_xpath('//div[@class="information-flow-list"]/div['+str(i)+']//a[@class="article-item-title weight-bold"]').texttitlelist.append(title)i = i + 1if ( i % 29 == 0):print("第"+str(i/29)+"次刷新页面,请稍等，目前获取了"+str(i)+"条数据")driver.find_element_by_xpath('//*[@id="app"]/div/div[1]/div[3]/div/div/div[1]/div/div/div[3]').click()except:print("wating")if i > 301:print("爬取完成,共获取"+str(i)+"条数据")break

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JSUDdDzt-1603186587695)(C:\Users\yandi\AppData\Roaming\Typora\typora-user-images\image-20201019160150865.png)]

4.合并数据

#%% 和并数据
import pandas as pdframe_title = pd.DataFrame(titlelist, columns=['title'])
frame_type = pd.DataFrame(typelist, columns=['type'])
frame_author = pd.DataFrame(authorlist, columns=['author'])info_frame = frame_type.join(frame_title).join(frame_author)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aId2ca5I-1603186587697)(C:\Users\yandi\AppData\Roaming\Typora\typora-user-images\image-20201019160406992.png)]

5.序列化写入

#%% 序列化写入
import pickleb = open(r"C:\Users\yandi\PycharmProjects\MachineLearing\36氪\info_frame.pkl", "wb")
pickle.dump(info_frame,b)
b.close()

![在这里插入图片描述](https://img-blog.csdnimg.cn/2020102

17393724.png#pic_center)

使用selenium爬取36氪相关推荐

Python+scrapy爬取36氪网
Python+Scrapy爬取36氪网新闻一.准备工作: ①安装python3 ②安装scrapy ③安装docker,用来运行splash,splash是用来提供js渲染服务(pyth ...
Selenium爬取36万条数据告诉你：网易云音乐热评究竟有什么规律？
网易云音乐火不火我不知道,可是评论很火,之前也见过不少的帖子抓取网易云音乐评论,今天咱们也来试试这篇文章主要介绍了python selenium爬取网易云音乐热评,文中通过示例代码介绍的非常详细,对 ...
Selenium 爬取评论数据，就是这么简单！
本文来自作者秦子敬在 GitChat 上分享「如何利用 Selenium 爬取评论数据?」,「阅读原文」查看交流实录「文末高能」编辑 | 飞鸿一.前言我们知道,如今的 web 网页数据很多 ...
python selenium爬取去哪儿网的酒店信息——详细步骤及代码实现
目录准备工作一.webdriver部分二.定位到新页面三.提取酒店信息 ??这里要注意?? 四.输出结果五.全部代码准备工作 1.pip install selenium 2.配置浏览器驱 ...
用selenium爬取csdn博客文章，并用4种方法提取数据
为了方便susu学习selenium,下面代码用selenium爬取博客文章的标题和时间,并用selenium自带的解析,etree,bs4,scrapy框架自带的selector等4种方式来解析网页 ...
使用Selenium爬取豆瓣电影前100的爱情片相关信息
slenium入门小练手之使用Selenium爬取豆瓣电影前100的爱情片相关信息文章目录什么是Selenium 1.准备工作 1.1 安装Selenium 1.2 浏览器驱动安装 1.3 环境变 ...
selenium 爬取cookie并且把数据下载到Excel
selenium 爬取cookie并且把数据下载到Excel import requests import re import csv import ftplib import os import s ...
python爬取酒店信息_python selenium爬取去哪儿网的酒店信息（详细步骤及代码实现）...
准备工作 1.pip install selenium 2.配置浏览器驱动.配置其环境变量 Selenium3.x调用浏览器必须有一个webdriver驱动文件 Chrome驱动文件下载chromed ...
23、selenium爬取歌曲精彩评论
我们这次试试用selenium爬取QQ音乐的歌曲评论,我选的歌是<甜甜的>. https://y.qq.com/n/yqq/song/000xdZuV2LcQ19.html 1 from ...

使用selenium爬取36氪

使用selenium爬取36氪相关推荐

最新文章

热门文章