最近由于自己研究需要,写了爬取华尔街日报的爬虫代码。核心是通过selenium并配置缓存文件进行抓取。

为了避免潜在的法律和版权风险,此贴仅供交流学习使用。

导入包

导入的有点多,有一些包是之前的一些尝试,不完全会用到。可以看自己需要导入。

关于selenium配置的相关问题,可以参看上篇文章。

import pyppeteer
import asyncio
import json
from pyppeteer import launch
import nest_asyncio
from pyppeteer.dialog import Dialog
from types import SimpleNamespace
from pyppeteer.connection import CDPSession
import time
from lxml import etree
import csv
import re
from tqdm import tqdm
import requests
import pandas as pd
import unicodedata
from string import punctuation
import requests
from requests.exceptions import TooManyRedirects
from bs4 import BeautifulSoup
from pyquery import PyQuery as pq
from string import punctuation
nest_asyncio.apply()
#print(pyppeteer.__chromium_revision__)  # 查看版本号
#print(pyppeteer.executablePath())
import os
import shutil
from selenium import webdriver
from selenium.webdriver.common.by import By
import random
import threading
from threading import Thread

获取COOKIES的几种方法

第一种方法:通过pyppeteer。

async def main():browser = await launch({'headless': False, 'args': ['--no-sandbox'], 'dumpio': True})page = await browser.newPage()await page.setViewport(viewport={'width': 1280, 'height': 800})await page.waitFor(2000) await page.goto('填入华尔街日报登入网址')await page.type('#username',"填入账号")await page.click("#basic-login > div:nth-child(1) > form > div:nth-child(2) > div:nth-child(6) > div.sign-in.hide-if-one-time-linking > button.solid-button.continue-submit.new-design > span")await page.waitFor(2000) await page.type('#password','填入密码')await page.click("#password-login > div > form > div > div:nth-child(5) > div.sign-in.hide-if-one-time-linking > button")await page.waitFor(2000) await asyncio.sleep(30)await page.evaluate( '''() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }''')page2 = await browser.newPage()await page2.setViewport(viewport={'width': 1280, 'height': 600})await page2.waitFor(1000)await page2.goto('https://www.wsj.com/?mod=mh_header')await page2.waitFor(3000) await asyncio.sleep(60)#手动点击切入跳出小框,同意获取cookiesorcookies=await page2.cookies()print (orcookies)cookies = {}for item in orcookies:cookies[item['name']] = item['value']with open("这里输入自己的路径.txt", "w") as f:f.write(json.dumps(cookies))await page2.evaluate( '''() =>{ Object.defineProperties(navigator,{ webdriver:{ get: () => false } }) }''')asyncio.get_event_loop().run_until_complete(main())

第二种方法:通过webdriver。

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import jsonbrowser = webdriver.Chrome(executable_path = '/opt/anaconda3/bin/chromedriver')
#browser.add_argument('-headless')un = '填入账号'
pw = '填入密码'
browser.get("填入华尔街日报登入网址")browser.find_element(By.XPATH,"//div/input[@name = 'username']").send_keys(un)
browser.find_element(By.XPATH,'//*[@id="basic-login"]/div[1]/form/div[2]/div[6]/div[1]/button[2]').click()time.sleep(10)browser.find_element(By.ID,"password-login-password").send_keys(pw)
browser.find_element(By.XPATH,'//*[@id="password-login"]/div/form/div/div[5]/div[1]/button').click()time.sleep(10)# 切换到跳出的小框
# driver.switch_to_frame("sp_message_iframe_490357")
# 点击接受收集 Cookies
# driver.find_element_by_xpath("//button[@title='YES, I AGREE']").click()# time.sleep(5)orcookies = browser.get_cookies()
print(orcookies)
cookies = {}
for item in orcookies:cookies[item['name']] = item['value']
with open("这里输入自己的路径.txt", "w") as f:f.write(json.dumps(cookies))

第三种方法:直接获取。

#可以直接用开发者程序或者插件获取cookies
#谷歌浏览器在需要获取Cookie的界面,按Ctrl+Shift+j打开js控制台
#输入 console.log(document.cookie) 回车打印Cookies
#定义一个可以直接复制网页cookies后清洗的函数
def cookie_clean(cookie):a=[]b=[]for item in cookie.split(';'):item=''.join(item.split())place=item.find('=')a.append(item[:place])b.append(item[place+1:])cookies=dict(zip(a,b)) return cookies

获取文章链接

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
"content-type": "application/json; charset=UTF-8",
"Connection": "keep-alive"
} # 需要替换为自己的请求头f= open("这是一个日期的列表文件的路径.txt", "r",encoding="cp936")
# 需要自己事先创建一个简单的日期列表文件,格式如:1998/01/01,具体囊括多少日期视爬取需求而定
line = f.readlines()for line in line[8596:]: # 如果链接发生变化,随时切片再次爬取date = line.strip().split("\t")[0]with open("这是cookies的路径.txt", "r")as g:cookies = g.read()cookies = json.loads(cookies)session = requests.session()url = "https://www.wsj.com/news/archive/" + datedata = session.get(url, headers=headers, cookies = cookies)time.sleep(1)try:soup = BeautifulSoup(data.content, 'html.parser') urls = [i.a['href'] for i in soup.find_all('div', {'class':'WSJTheme--headline--7VCzo7Ay'})]articles = [i.text for i in soup.find_all('div', {'class':'WSJTheme--headline--7VCzo7Ay'})]articles = [unicodedata.normalize('NFD', i).encode('ascii', 'ignore').decode("utf-8").replace("\n"," ").replace('\t',"") for i in articles if len(i)>=1]modu=soup.find_all('div', {'class':'WSJTheme--overflow-hidden--qJmlzHgO'})categories = [i.find('div').text.strip().split("\t")[0] for i in modu]categories = [unicodedata.normalize('NFD', i).encode('ascii', 'ignore').decode("utf-8").replace("\n","").replace('\t',"") for i in categories if len(i)>=0]page_num = int(soup.find('span', {'class': "WSJTheme--pagepicker-total--Kl350I1l"}).text.strip().replace('of ', ''))with open("这里输入自己的路径.txt",'a') as j:for k, i in enumerate(categories):j.write(articles[k]+'\t'+ urls[k]+'\t'+categories[k]+'\t'+ date + '\n')     if page_num == 1: print("pn=1")else: #翻页for pn in range(2, page_num+1):print(pn)time.sleep(1) new_url = url+'?page='+str(pn)data1 = session.get(new_url, headers=headers, cookies = cookies)time.sleep(1)soup1 = BeautifulSoup(data1.content, 'html.parser')urls1 = [i.a['href'] for i in soup1.find_all('div', {'class':'WSJTheme--headline--7VCzo7Ay'})]articles1 = [i.text for i in soup1.find_all('div', {'class':'WSJTheme--headline--7VCzo7Ay'})]articles1 = [unicodedata.normalize('NFD', i).encode('ascii', 'ignore').decode("utf-8").replace("\n"," ").replace('\t',"") for i in articles1 if len(i)>=1]modu1=soup1.find_all('div', {'class':'WSJTheme--overflow-hidden--qJmlzHgO'})categories1 = [i.find('div').text.strip().split("\t")[0] for i in modu1]categories1 = [unicodedata.normalize('NFD', i).encode('ascii', 'ignore').decode("utf-8").replace("\n","").replace('\t',"") for i in categories1 if len(i)>=0]with open("这里输入自己的路径.txt",'a') as j:for k, i in enumerate(categories1):j.write(articles1[k]+'\t'+ urls1[k]+'\t'+categories1[k]+'\t'+ date + '\n')    except Exception as e: #记录报错信息print(url, e)with open("这里输入自己的路径.txt",'a') as l:l.write(url+'\t'+ date)pass

爬取文章内容

之前可以直接爬取网站的内容,不需要加载tmp文件。但似乎今年WSJ网站做了一些些更新,所以代码要更复杂一些。

这里先呈现之前的代码。

# 旧的获取内容方法def get_headers_and_cookies():headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3542.0 Safari/537.36',"content-type": "application/json; charset=UTF-8","Connection": "keep-alive"}with open("wsjcookies.txt", "r")as f:cookies = f.read()cookies = json.loads(cookies)return headers,cookiesdef open_file():#这里是已经整理过的文章链接f2= open("文章链接路径.txt", "r",encoding="cp936")lines = f2.readlines()return linesdef get_text(url,date,title,headers,cookies):errorlog = []try:# requestingdata = session.get(url, headers=headers, cookies = cookies)time.sleep(random.random())soup = BeautifulSoup(data.content, 'html.parser') # parsingheads = soup.find('div', {'class':'wsj-article-headline-wrap'}).textcontent = soup.find('div', {'class':'article-content'}).textcontent = heads + content# savingwith open ('输入存储路径/%s_%s.txt'%(date, title),'w',encoding='utf-8',errors='ignore') as j:j.write(content)j.close()except Exception as e:print(url, e)with open ('输入存储路径/%s_%s.txt'%(date, title),'w',encoding='utf-8',errors='ignore') as j:j.write(title)j.close()errorlog.append([url, e])passdef main():lines = open_file()headers,cookies= get_headers_and_cookies()session = requests.session()for line in lines:linenew = line.split('\t')title = linenew[0].replace('/','-')url = linenew[1]category = linenew[2]date = linenew[3].replace('\n','').replace('/','-')get_text(url,date,title,headers,cookies)if __name__ == '__main__':main()

现在呈现更新版的代码。

# 首先加载配置文件,具体加载方法见上一篇分享
option = webdriver.ChromeOptions()
option.add_argument(r"user-data-dir=配置文件路径/tmp")# 加载配置文件夹,直接command+shift+g 查找文件路径
driver = webdriver.Chrome(options=option)
class crawling():def __init__(self,driver,num1,num2,file):self.lines = open("这里是文章列表的文件.txt", "r",encoding="cp936").readlines()self.driver = driverself.num1 = num1self.num2 = num2self.file = filedef get_article(self):errorlog = []driver = self.driverurl = self.urldate = self.datetitle = self.titlenewlink = url.split('/')[-1].split('.')[0].replace('?',"---") if len(newlink) <= 200:try:driver.get(url)time.sleep(8)titlename = driver.find_element(By.XPATH,'//*[@id="__next"]/div/main/div[2]/div[1]/div')writername = driver.find_element(By.XPATH,'//*[@id="__next"]/div/main/div[2]/article/div[2]/div[1]/div')article = driver.find_element(By.XPATH,'//*[@id="__next"]/div/main/div[2]/article/div[2]/section')content = titlename.text + '\n' + writername.text + '\n' + article.texttime.sleep(1)with open ('存储路径/'+self.file+'/%s_%s.txt'%(date, newlink),'w',encoding='utf-8',errors='ignore') as j:j.write(content)j.close()except Exception as e:#print(url, e)with open ('存储路径/'+self.file+'/%s_%s.txt'%(date, newlink),'w',encoding='utf-8',errors='ignore') as j:j.write(title)j.close()errorlog.append([url, e])passelse:newlink2 = newlink[:200] # 规避过长的文件名try:driver.get(url)time.sleep(8)titlename = driver.find_element(By.XPATH,'//*[@id="__next"]/div/main/div[2]/div[1]/div')writername = driver.find_element(By.XPATH,'//*[@id="__next"]/div/main/div[2]/article/div[2]/div[1]/div')article = driver.find_element(By.XPATH,'//*[@id="__next"]/div/main/div[2]/article/div[2]/section')content = titlename.text + '\n' + writername.text + '\n' + article.text + newlinktime.sleep(1)with open ('存储路径/'+self.file+'/%s_%s.txt'%(date, newlink2),'w',encoding='utf-8',errors='ignore') as j:j.write(content)j.close()except Exception as e:title2 = title + '\n' + newlink#print(url, e)with open ('存储路径/'+self.file+'/%s_%s.txt'%(date, newlink2),'w',encoding='utf-8',errors='ignore') as j:j.write(title2)j.close()errorlog.append([url, e])passdef main(self):for line in self.lines[self.num1:self.num2]:linenew = line.split('\t')self.title = linenew[0].replace('/','-')self.url = linenew[1]self.category = linenew[2]self.date = linenew[3].replace('\n','').replace('/','-')self.get_article()
#多线程运行def muti_crawling(path,n1,n2,file):option = webdriver.ChromeOptions()option.add_argument(path)driver = webdriver.Chrome(options=option)a = crawling(driver,n1,n2,file)a.main()if __name__ == '__main__':t1 = Thread(target=muti_crawling,args=(r'配置文件路径/tmp',0,45945,'1998')) #前两个数字为起止链接的位序,后一个数字代表年份t2 = Thread(target=muti_crawling,args=(r'配置文件路径/tmp2',45945,94758,'1999'))# 多线程,随时调整# 无文字新闻以及网站404不能爬取# 每个线程分不同文档分别储存t1.start()t2.start()
# 定位分割点
li = []
lines = open("文章列表路径.txt", "r",encoding="cp936").readlines()
for line in lines:linenew = line.split('\t')date = linenew[3].replace('\n','').replace('/','-')li.append(date)
index = li.index('2004-01-01')
index

以上。

使用Python爬取华尔街日报(WALL STREET JOURNAL)全文相关推荐

  1. python 爬取https://wall.alphacoders.com上的壁纸(入门级别)

    python 爬取https://wall.alphacoders.com上的壁纸 0,环境 python3.7 库:requests,BeautifulSoup4 1,目标 https://wall ...

  2. python爬取长春长生2016-2018所有被批准疫苗批次

    1. 导入库 2. 获取每张表格所在的URL 3. 从URL读取公示数据 3.1 获取表格 3.2 筛选长生公司的产品 4. 数据分析 本文使用Python爬取了中国食品药品检定研究院2016年1月- ...

  3. python爬取电影评分_用Python爬取猫眼上的top100评分电影

    代码如下: # 注意encoding = 'utf-8'和ensure_ascii = False,不写的话不能输出汉字 import requests from requests.exception ...

  4. 用Python爬取好奇心日报

    用Python爬取好奇心日报 本项目最后更新于2018-7-24,可能会因为没有更新而失效.如已失效或需要修正,请联系我! 本项目已授权微信公众号"菜鸟学Python"发表文章 爬 ...

  5. python爬取新闻并归数据库_Python爬取数据并写入MySQL数据库操作示例

    Python爬取数据并写入MySQL数据库的实例 首先我们来爬取 http://html-color-codes.info/color-names/ 的一些数据. 按 F12 或 ctrl+u 审查元 ...

  6. Python 爬取北京二手房数据,分析北漂族买得起房吗?(附完整源码)

    来源:CSDN 本文约3500字,建议阅读9分钟. 本文根据Python爬取了赶集网北京二手房数据,R对爬取的二手房房价做线性回归分析,适合刚刚接触Python&R的同学们学习参考. 房价高是 ...

  7. python爬取天气_python3爬取各类天气信息

    本来是想从网上找找有没有现成的爬取空气质量状况和天气情况的爬虫程序,结果找了一会儿感觉还是自己写一个吧. 主要是爬取北京包括北京周边省会城市的空气质量数据和天气数据. 过程中出现了一个错误:Unico ...

  8. html如何获取请求头变量的值。_如何使用 Python 爬取微信公众号文章

    我比较喜欢看公众号,有时遇到一个感兴趣的公众号时,都会感觉相逢恨晚,想一口气看完所有历史文章.但是微信的阅读体验挺不好的,看历史文章得一页页的往后翻,下一次再看时还得重复操作,很是麻烦. 于是便想着能 ...

  9. python爬取网页书籍名称代码_python爬取亚马逊书籍信息代码分享

    我有个需求就是抓取一些简单的书籍信息存储到mysql数据库,例如,封面图片,书名,类型,作者,简历,出版社,语种. 我比较之后,决定在亚马逊来实现我的需求. 我分析网站后发现,亚马逊有个高级搜索的功能 ...

最新文章

  1. NLP自然语言常见问题及相关模型训练数据格式示例
  2. python.freelycode.com-Python日期时间处理: datestuff
  3. Linux内核中的GPIO系统之(3):pin controller driver代码分析
  4. 高手不得不知的Java集合List的细节
  5. C# 依赖注入那些事儿
  6. Android 常用的数据加密方式
  7. 盘点云计算领域不可不读的9本书
  8. 查期刊是否开源_新期刊HardwareX促进科学的开源硬件
  9. 修改表名的sql语句_SQL第一关——入门
  10. 请检查virtualboxapi是否正确安装_MBR膜组件安装施工方案指南
  11. 微软3月补丁星期二修复71个漏洞,其中3个是0day
  12. Web API-document
  13. C/C++结构体语法总结
  14. 云计算机平台 优势,云计算平台有哪些优势
  15. 安全测试-渗透性测试
  16. 科目二 座椅调节 记录
  17. 分析流量对防御DDOS攻击有何价值?
  18. 在windows下编译erlang内建函数(nif)的dll文件
  19. vsCode在window电脑中安装FiraCode字体(好看字体)
  20. android系统的市场占有率,安卓系统市场占有率竟然比苹果iOS高了这么多

热门文章

  1. 【BZOJ 3631】 [JLOI2014]松鼠的新家
  2. JAVA毕业设计Vue.js网上书城管理系统设计与实现服务端计算机源码+lw文档+系统+调试部署+数据库
  3. lua虚拟机的整体结构
  4. 漫画:什么是平衡二叉树?
  5. 史上最全树莓派安装方法
  6. XML解析的几种方式
  7. 私网IP如何访问Internet
  8. 腾讯云公网负载均衡技术实现详解
  9. 计算机软考落户广州,想要入户广州,考这个证就够了
  10. FPGA-HDMI-彩条显示实验(ZYBO Z7)