python爬虫——selenium+bs4爬取选股宝‘利好‘or’利空'股票信息

一.前言。

（1）我个人比较喜欢先看结果，再看内容，so,结果如图:

（2）信息抓取自选股宝https://xuangubao.cn/（我这里设定抓取加载20页，下面只列举几个）：

（3）本次主要应用到了Python：

1.正则表达式；

Python3 正则表达式：http://www.runoob.com/python3/python3-reg-expressions.html

2.Selenium模拟浏览器行为；

3.BeautifulSoup4进行剖析:

官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

（4）运行环境or编译软件：

1》python 3.6

2》Selenium 3.12.0

3》BeautifulSoup4.6

5》pip 10.0

6》JetBrains PyCharm Community Edition 2017.3.4 x64

二.实战

（1）导入库，这些库安装和配置网上都有教程（其实我已经不知道自己是怎样装好的了，反正各种百度）

from bs4 import BeautifulSoup
import re
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

（2）网页源码抓取；

def gethtml(str):options = Options()options.add_argument('-headless')driver = Firefox()#火狐浏览器driver.get(str)for a in range(0,20):#动态加载20页的网页数据loadmore = driver.find_element_by_xpath("//span[@class='home-news-footer-loadmore']").click()#模拟鼠标点击“点击加载更多”

a=driver.page_source#获取到页面源码。

driver.quit()#关闭浏览器 return a

模拟点击“点击加载更多”，设置点击20次。O(∩_∩)O哈哈~

（3）信息提取；

def getinfor(lst,html_str,str_type,str_char):soup = BeautifulSoup(html_str,'html.parser')bu = soup.find_all(class_=str_type)#搜索‘利好’or‘利空’所在直接标签:for date in bu:bu_name = date.parent.parent.find_all(class_="stock-group-item")#利好’or‘利空’所在信息块有股票信息才继续if not bu_name == []:print()date_=date.parent.parent.parent.parent.parent#date_month=date_.find(class_="news-item-timeline-date-month")#月print(date_month.string,end='/')date_day=date_.find(class_="news-item-timeline-date-day")#日print(date_day.string,end='日/')date_time=date_.find(class_=re.compile("news-item-timeline-time .*")).get_text()#时间date_time_=re.compile(r'[0-9]{1,2}:[0-9]{1,2}').search(''.join(date_time))print(date_time_.group(),end='/')print(str_char, end=' ')for a in bu_name:stock_name=a.find(class_="stock-group-item-name")#股票名字print(stock_name.string, end='[')stock_name = a.find(class_="stock-group-item-rate")#指数print(stock_name.string, end='] ')print()

解析：

1》先定位‘利好’（‘利空’），通过所在<span>标签的属性class="bullish-and-bear bullish"（利空为class="bullish-and-bear bear"）

2》搜索有股票才继续（如'焦作万方'）,因为有些没有。

3》通过date_=date.parent.parent.parent.parent.parent定位到总<li>,在里面可以用find()方法定位所要信息所在标签。

（4）主方法调用；

def main1():stock_list_url = 'https://xuangubao.cn'stock_info_url = 'https://gupiao.baidu.com/stock/'stock_url = gethtml(stock_list_url)getinfor(stock_url,'bullish-and-bear bullish', ' 利好：')getinfor(stock_url,'bullish-and-bear bear',' 利空：')

三.总结。

（1）对python爬虫有了一定了解。

（2）对相关库有一定认识，尤其是在安装库的时候，真的不是pip install ***就完是的了。

（3）接触pychar,知道了pychar的一些基本使用。

（4）这次是第一次爬虫，主要是应老师要求【黑脸】，要学的还有很多，简单爬取一些信息，没有明确的目的。欢迎各位朋友一起交流啊。有问题的，欢迎指出。

python爬虫——selenium+bs4爬取选股宝‘利好‘or’利空'股票信息相关推荐

python爬虫selenium和bs4_python爬虫――selenium+bs4爬取选股宝‘利好‘or’利空'股票信息...
一.前言.(1)我个人比较喜欢先看结果,再看内容,so,结果如图: (2)信息抓取自选股宝https://xuangubao.cn/(我这里设定抓取加载20页,下面只列举几个): (3)本次主要应用到 ...
Python爬虫学习之爬取淘宝搜索图片
Python爬虫学习之爬取淘宝搜索图片准备工作因为淘宝的反爬机制导致Scrapy不能使用,所以我这里是使用selenium来获取网页信息,并且通过lxml框架来提取信息. selenium.lxm ...
python爬虫 requests+bs4爬取猫眼电影傻瓜版教程
python爬虫 requests+bs4爬取猫眼电影傻瓜版教程前言一丶整体思路二丶遇到的问题三丶分析URL 四丶解析页面五丶写入文件六丶完整代码七丶最后前言大家好我是墨绿头顶总 ...
Python爬虫实战(5)-爬取淘宝网服装图片(Selenium+Firefox)
前言今天我们巩固一下前面学过的知识,通过Selenium+Firefox实现模拟浏览器并自动翻页,爬取图片并写入本地文件中. 以搜索"女装"为例,自动爬取"女装&quo ...
[python爬虫] Selenium定向爬取PubMed生物医学摘要信息
本文主要是自己的在线代码笔记.在生物医学本体Ontology构建过程中,我使用Selenium定向爬取生物医学PubMed数据库的内容. PubMed是一个免费的搜寻引擎,提供生物医学 ...
[python爬虫] Selenium定向爬取海量精美图片及搜索引擎杂谈
我自认为这是自己写过博客中一篇比较优秀的文章,同时也是在深夜凌晨2点满怀着激情和愉悦之心完成的.首先通过这篇文章,你能学到以下几点: 1.可以了解Python简单爬取图片的一些思路和方 ...
python爬取论坛图片_[python爬虫] Selenium定向爬取虎扑篮球海量精美图片
前言: 作为一名从小就看篮球的球迷,会经常逛虎扑篮球及湿乎乎等论坛,在论坛里面会存在很多精美图片,包括NBA球队.CBA明星.花边新闻.球鞋美女等等,如果一张张右键另存为的话真是手都点疼了.作为程序员 ...
python爬虫——使用bs4爬取链家网的房源信息
1. 先看效果 2. 进入链家网,这里我选择的是海口市点击跳转到链家网 3. 先看网页的结构,这些房子的信息都在li标签,而li标签再ul标签,所以怎么做大家都懂 4. 代码如下,url的链接大家可以 ...
python爬虫实战-bs4爬取2345电影
抓取的原理也比较简单,不过多解释了,代码注释的也比较清楚参考: Python网络爬虫实战(第二版) # -*- coding: utf-8 -*- """ Create ...

python爬虫——selenium+bs4爬取选股宝‘利好‘or’利空'股票信息

一.前言。

二.实战

三.总结。

python爬虫——selenium+bs4爬取选股宝‘利好‘or’利空'股票信息相关推荐

最新文章

热门文章