python爬取股票评论_Python爬虫股票评论，snowNLP简单分析股民用户情绪

原标题：Python爬虫股票评论，snowNLP简单分析股民用户情绪

一、背景

股民是网络用户的一大群体，他们的网络情绪在一定程度上反映了该股票的情况，也反映了股市市场的波动情况。作为一只时间充裕的研究僧，我课余时间准备写个小代码get一下股民的评论数据，分析以下用户情绪的走势。代码还会修改，因为结果不准确，哈哈！

二、数据来源

本次项目不用于商用，数据来源于东方财富网，由于物理条件，我只获取了一只股票的部分评论，没有爬取官方的帖子，都是获取的散户的评论。

三、数据获取

Python是个好工具，这次我使用了selenium和PhantomJS组合进行爬取网页数据，当然还是要分析网页的dom结构拿到自己需要的数据。

爬虫部分：

fromselenium importwebdriver

importtime

importjson

importre

# from HTMLParser import HTMLParser

frommyNLP import *

# from lxml import html

# import requests

classCrawler:

url =''

newurl =set()

headers ={}

cookies ={}

def__init__( self, stocknum, page) :

self.url ='http://guba.eastmoney.com/list,'+stocknum+',5_'+page+'.html'

cap =webdriver.DesiredCapabilities.PHANTOMJS

cap[ "phantomjs.page.settings.resourceTimeout"] =1000

#cap["phantomjs.page.settings.loadImages"] = False

#cap["phantomjs.page.settings.localToRemoteUrlAccessEnabled"] = True

self.driver =webdriver.PhantomJS( desired_capabilities=cap)

defcrawAllHtml( self, url) :

self.driver.get( url)

time.sleep( 2)

# htmlData = requests.get(url).content.decode('utf-8')

# domTree = html.fromstring(htmlData)

# return domTree

defgetNewUrl( self, url) :

self.newurl.add( url)

deffilterHtmlTag( self, htmlStr) :

self.htmlStr =htmlStr

#先过滤CDATA

re_cdata=re.compile( '//

[ >] ∗//

>',re.I) #匹配CDATA

re_=re.compile( ']*>[^',re.I) #

re_style=re.compile( ']*>[^',re.I) #style

re_br=re.compile( '') #处理换行

re_h=re.compile( '?w+[^>]*>') #HTML标签

re_comment=re.compile( '') #HTML注释

s=re_cdata.sub( '', htmlStr) #去掉CDATA

s=re_.sub( '',s) #去掉

s=re_style.sub( '',s) #去掉style

s=re_br.sub( 'n',s) #将br转换为换行

blank_line=re.compile( 'n+') #去掉多余的空行

s=blank_line.sub( 'n',s)

s=re_h.sub( '',s) #去掉HTML标签

s =re_comment.sub( '',s) #去掉HTML注释

#去掉多余的空行

blank_line =re.compile( 'n+')

s =blank_line.sub( 'n',s)

returns

defgetData( self) :

comments =[]

self.crawAllHtml( self.url)

postlist =self.driver.find_elements_by_xpath( '//*[@id="articlelistnew"]/div')

forpost inpostlist :

href =post.find_elements_by_tag_name( 'span')[ 2].find_elements_by_tag_name( 'a')

iflen(href) :

self.getNewUrl(href[ 0].get_attribute( 'href'))

# if len(post.find_elements_by_xpath('./span[3]/a/@href')):

# self.getNewUrl('http://guba.eastmoney.com'+post.find_elements_by_xpath('./span[3]/a/@href')[0])

forurl inself.newurl :

self.crawAllHtml(url)

time =self.driver.find_elements_by_xpath( '//*[@id="zwconttb"]/div[2]')

post =self.driver.find_elements_by_xpath( '//*[@id="zwconbody"]/div')

age =self.driver.find_elements_by_xpath( '//*[@id="zwconttbn"]/span/span[2]')

iflen(post) andlen(time) andlen(age) :

text =self.filterHtmlTag(post[ 0].text)

iflen(text) :

tmp =myNLP(text)

comments.append({ 'time':time[ 0].text, 'content':tmp.prob, 'age':age[ 0].text})

commentlist =self.driver.find_elements_by_xpath( '//*[@id="zwlist"]/div')

iflen(commentlist) :

forcomment incommentlist :

time =comment.find_elements_by_xpath( './div[3]/div[1]/div[2]')

post =comment.find_elements_by_xpath( './div[3]/div[1]/div[3]')

age =comment.find_elements_by_xpath( './div[3]/div[1]/div[1]/span[2]/span[2]')

iflen(post) andlen(time) andlen(age) :

text =self.filterHtmlTag(post[ 0].text)

iflen(text) :

tmp =myNLP(text)

comments.append({ 'time':time[ 0].text, 'content':tmp.prob, 'age':age[ 0].text})

returnjson.dumps(comments) 存储部分：

这部分其实可以用数据库来做，但是由于只是试水，就简单用 json文件来存部分数据 importio

classFile:

name =''

type =''

src =''

file =''

def__init__( self, name, type, src) :

self.name =name

self.type =type

self.src =src

filename =self.src +self.name +'.'+self.type

self.file =io.open(filename, 'w+', encoding='utf-8')

definputData( self, data) :

self.file.write( data.decode( 'utf-8'))

self.file.close()

defcloseFile( self) :

self.file.close() 测试用的local服务器：

这里只是为了要用浏览器浏览数据图，由于需要读取数据，js没有权限操作本地的文件，只能利用一个简单的服务器来弄了

importSimpleHTTPServer

importSocketServer;

PORT = 8000

Handler = SimpleHTTPServer.SimpleHTTPRequestHandler

httpd = SocketServer.TCPServer(("", PORT), Handler);

httpd.serve_forever()NLP部分：snowNLP这个包还是用来评价买卖东西的评论比较准确

不是专门研究自然语言的，直接使用他人的算法库。这个snowNLP可以建立一个训练，有空自己来弄一个关于股票评论的。

#!/usr/bin/env python

# -*- coding: UTF-8 -*-

fromsnownlpimportSnowNLP

classmyNLP:

prob = 0.5

def_init_(self, text):

self.prob = SnowNLP(text).sentiments主调度：

# -*- coding: UTF-8 -*-

'''''

Created on 2017年5月17日

@author: luhaiya

@id: 2016110274

@deion:

'''

#http://data.eastmoney.com/stockcomment/ 所有股票的列表信息

#http://guba.eastmoney.com/list,600000,5.html 某只股票股民的帖子页面

#http://quote.eastmoney.com/sh600000.html?stype=stock 查询某只股票

fromCrawlerimport*

fromFileimport*

importsys

default_encoding = 'utf-8'

ifsys.getdefaultencoding() != default_encoding:

sys.setdefaultencoding(default_encoding)

defmain():

stocknum = str(600000)

total = dict()

foriinrange(1,10):

page = str(i)

crawler = Crawler(stocknum, page)

datalist = crawler.getData()

comments = File(stocknum+'_page_'+page,'json','./data/')

comments.inputData(datalist)

data = open('./data/'+stocknum+'_page_'+page+'.json','r').read()

jsonData = json.loads(data)

fordetailinjsonData:

num = '1'if'年'notindetail['age'].encode('utf-8')elsedetail['age'].encode('utf-8').replace('年','')

num = float(num)

date = detail['time'][4:14].encode('utf-8')

total[date] = total[date] ifdateintotal.keys()else{'num':0,'content':0}

total[date]['num'] = total[date]['num'] + numiftotal[date]['num']elsenum

total[date]['content'] = total[date]['content'] + detail['content']*numiftotal[date]['content']elsedetail['content']*num

total = json.dumps(total)

totalfile = File(stocknum,'json','./data/')

totalfile.inputData(total)

if__name__ =="__main__":

main()四、前端数据展示

使用百度的echarts。用户的情绪是使用当天所有评论的情绪值的加权平均，加权系数与用户的股龄正相关。

分析图表

body{texr-align:center;}

#mainContainer{width:100%;}

#fileContainer{width:100%; text-align:center;}

#picContainer{width: 800px;height:600px;margin:0 auto;}

这里是文件夹列表

python爬取股票评论_Python爬虫股票评论，snowNLP简单分析股民用户情绪相关推荐

Python爬虫股票评论，snowNLP简单分析股民用户情绪（草稿）
一.背景股民是网络用户的一大群体,他们的网络情绪在一定程度上反映了该股票的情况,也反映了股市市场的波动情况.作为一只时间充裕的研究僧,我课余时间准备写个小代码get一下股民的评论数据,分析以 ...
python爬取豆瓣小组_Python 爬虫实例+爬取豆瓣小组 + wordcloud 制作词云图
目标利用PYTHON爬取如下图中所有回答的内容,并且制作词云图. 用到的库 import requests # import json from PIL import Image from pyqu ...
python爬取cctalk视频_python爬虫urllib使用和进阶 | Python爬虫实战二
python爬虫urllib使用和进阶上节课已经介绍了爬虫的基本概念和基础内容,接下来就要开始内容的爬取了. 其实爬虫就是浏览器,只不过它是一个特殊的浏览器.爬取网页就是通过HTTP协议访问相应的网 ...
python爬取控制台信息_python爬虫实战之爬取智联职位信息和博客文章信息
1.python爬取招聘信息简单爬取智联招聘职位信息 # !/usr/bin/env python # -*-coding:utf-8-*- """ @Author ...
python爬取携程网游记_Python爬虫案例：爬取携程评论
前言之前爬取美团,马蜂窝等网站的数据都挺顺利,大众点评(这个反爬机制有点麻烦)在磕磕绊绊中也算成功(重点是网页页数的变化和关键字的隐藏替换)但携程居然遇到了瓶颈. 主要是查看源代码时发现关键商户信息 ...
python爬取微博文本_Python爬虫爬取新浪微博内容示例【基于代理IP】
本文实例讲述了Python爬虫爬取新浪微博内容.分享给大家供大家参考,具体如下: 用Python编写爬虫,爬取微博大V的微博内容,本文以女神的微博为例(爬新浪m站:https://m.weibo.cn ...
python爬取多页_Python 爬虫 2 爬取多页网页
本文内容: Requests.get 爬取多个页码的网页例:爬取极客学院课程列表爬虫步骤打开目标网页,先查看网页源代码 get网页源码找到想要的内容,找到规律,用正则表达式匹配,存储结果 Re ...
python爬取新闻网站内容_python爬虫案例：抓取网易新闻
此文属于入门级级别的爬虫,老司机们就不用看了. 本次主要是爬取网易新闻,包括新闻标题.作者.来源.发布时间.新闻正文. 首先我们打开163的网站,我们随意选择一个分类,这里我选的分类是国内新闻.然后鼠 ...
python爬取小说基本信息_Python爬虫零基础实例---爬取小说吧小说内容到本地
Python爬虫实例--爬取百度贴吧小说写在前面本篇文章是我在简书上写的第一篇技术文章,作为一个理科生,能把仅剩的一点文笔拿出来献丑已是不易,希望大家能在指教我的同时给予我一点点鼓励,谢谢. 一.介 ...

python爬取股票评论_Python爬虫股票评论，snowNLP简单分析股民用户情绪

python爬取股票评论_Python爬虫股票评论，snowNLP简单分析股民用户情绪相关推荐

最新文章

热门文章