简单的爬虫入门--爬取百度股票信息--来自mooc嵩老师视频

这个仅仅作为自己做这个项目的一些过程的记录和理解吧~~~

1、import 部分，将使用到的一些库引入进来

import requests
import re
from bs4 import BeautifulSou

2、先做一个简单的例子，说明原理

https://gupiao.baidu.com/stock/sz300059.html 这个链接，是百度股票查询的网页，我们可以通过右键查看网页源代码对其进行分析。我们就定义这个页面，是我们想要的页面信息

url='https://gupiao.baidu.com/stock/sz300059.html'
r=requests.get(url)

得到了返回的页面，我们定义html变量，为网页的整个内容

r.encoding = r.apparent_encoding
html=r.text

3、使用BeautifulSoup 做一锅汤

soup=BeautifulSoup(r,'html,parser')

我们对比一下不做汤，和做汤之后的区别吧

（1）不做汤

（2）做汤

可以看出更加结构化的不同喔

4、解析源网页的代码内容（静态网页）

可见上面的信息是分块的，所以我们的解析的方法如下：

同时看到这个的网址内容

刚好，我们可以通过获取这个class中的内容进行进一步的拓展

5、编写相关的代码

获取对应的股票的名称：

stockName=soup.find(attrs={'class':'bets-name'})

上面i就是如果我这样写的运行的结果。

所以真正的name 应该如下：

name=stockName.text.split()[0]

其中[0] 表示取前面的那个。

获取股票的今日信息：

那么这样信息肯定最好就直接存成列表的格式啊

那么，我们同样使用find功能

同样的对于对应的实际的数值

所以系列代码如下：

infoDict={}
stockName=soup.find(attrs={'class':'bets-name'})infoDict.update({'股票名称':stockName.text.split()[0]})stockInfo=soup.find('div',attrs={'class':'bets-content'})
keyList=stockInfo.find_all('dt')
keyVal=stockInfo.find_all('dd')
#to make it a listfor i in range(len(keyList)):key=keyList[i].textval=valueList[i].textinfoDict[key]=val   #建立字一个字典的幅值的方式，我们在原来的基础上继续增加新的

综上，是一个，静态原网页的提取的方式。

提取一系列的内容如下的代码，是我按照老师教的写的，并且调试好的代码：作为参考吧

import requests
from bs4  import  BeautifulSoup
import traceback
import  redef getHTMLText(url,code='utf-8'):try:r = requests.get(url, timeout=30)r.raise_for_status()r.encoding = r.apparent_encoding # 修改编码return r.textexcept:return ''def getStockList(lst,stockURL):print('List information')html=getHTMLText(stock_list_url)#print(html)soup=BeautifulSoup(html,'html.parser')a=soup.find_all('a')'''<li><a target="_blank" href="http://quote.eastmoney.com/sh201000.html">R003(201000)</a></li>'''for i in a:try:# 个股的股票编号保存在lst中href = i.attrs['href']# print(type(href))lst.append(re.findall(r'[s][hz]\d{6}', href)[0])# findall返回列表，比如a=[sh012345],取0，就取出了里面的数字# 在append到lst里，这样lst就不会是[[sh201000], [sh201002]]这样except:continue# print(lst)print('ending list')def getStockInfo(lst,stockURL,fpath):count=0   #进度条用的print('start information')for stock  in lst:url=stockURL+stock+'.html'print(url)html=getHTMLText(url)try:if html=='':continueinfoDict={}soup=BeautifulSoup(html,'html.parser')#print(soup)stockInfo=soup.find('div',attrs={'class':'stock-bets'})print(stockInfo)#  print(stockInfo)#name=soup.find('div',attrs={'class':'bets-name'})name = soup.find('a', attrs={'class': 'bets-name'})print(name)# print(name)infoDict.update({'股票名称':name.text.split()[0]})#  print(infoDict)#print(stockInfo)keyList=stockInfo.find_all('dt')#print(keyList)valueList=stockInfo.find_all('dd')#print(valueList)for i in range(len(keyList)):key=keyList[i].textval=valueList[i].textinfoDict[key]=valwith open(fpath,'a',encoding='utf-8') as f:f.write(str(infoDict)+'\n')count = count+1print('\r当前速度：{:.2f}%'.format(count*100/len(lst)),end='')except:count = count + 1print('\r当前速度：{:.2f}%'.format(count * 100 / len(lst)), end='')traceback.print_exc()continueif __name__=='__main__':stock_list_url='http://quote.eastmoney.com/stocklist.html'stock_info_url='https://gupiao.baidu.com/stock/'output_file='E:\BaiduStockInfo.txt'slist=[]getStockList(slist,stock_list_url)getStockInfo(slist,stock_info_url,output_file)

简单的爬虫入门--爬取百度股票信息--来自mooc嵩老师视频相关推荐

【python 爬虫】 scrapy 入门--爬取百度新闻排行榜
scrapy 入门–爬取百度新闻排行榜环境要求:python2/3(anaconda)scrapy库开发环境:sublime text + windows cmd 下载scrapy(需要pytho ...
入门级别的Python爬虫代码爬取百度上的图片
简单讲解下python爬取百度图片的方法还有一些小坑(ps:我是搞.net的所以python只是新手讲错勿怪,注意:系统是windows下的) 首先讲下对百度图片上请求的分析:这里我引用下别人的博客, ...
python爬虫图片-如何用Python来制作简单的爬虫，爬取到你想要的图片
原标题:如何用Python来制作简单的爬虫,爬取到你想要的图片在我们日常上网浏览网页的时候,经常会看到一些好看的图片,我们就希望把这些图片保存下载,或者用户用来做桌面壁纸,或者用来做设计的素材. 我 ...
爬虫入门--爬取就业网站上的岗位信息构造数据集
爬虫入门--爬取就业网站上的岗位信息.解析爬取的数据构造数据集爬虫入门实践爬虫的基本概念爬虫的技术实现爬虫入门实践大家好!随着大数据分析逐渐火热的今天,爬虫技能也成了数据分析师一项不可或缺的 ...
python爬虫之爬取百度网盘
爬虫之爬取百度网盘(python) #coding: utf8 """ author:haoning create time: 2015-8-15 "" ...
python爬取b站视频封面_学习笔记(4)[Python爬虫]：爬取B站搜索界面的所有视频的封面...
学习笔记(4)[Python爬虫]:爬取B站搜索界面的所有视频的封面 import os import requests import re import json from bs4 import B ...
二、入门爬虫，爬取百度图片
什么是爬虫网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模 ...
python入门爬虫之爬取百度首页的热搜榜
博主的公众号:Java4y <<<<<<<<< 一个努力提高工作效率(增加摸鱼时间)的小白博主 >>>>>>& ...
【爬虫】爬取百度贴吧数据
在这里我们写一个简单的小爬虫程序,爬取百度贴吧前几页的数据. import requests import sysclass Tieba(object): def __init__(self, nam ...

简单的爬虫入门--爬取百度股票信息--来自mooc嵩老师视频

简单的爬虫入门--爬取百度股票信息--来自mooc嵩老师视频相关推荐

最新文章

热门文章