使用scrapy爬取dota2贴吧数据并进行分析

一直好奇贴吧里的小伙伴们在过去的时间里说的最多的词是什么，那我们就来抓取分析一下贴吧发文的标题内容，并提取分析一下，看看吧友们在说些什么。

首先我们使用scrapy对所有贴吧文章的标题进行抓取

scrapy startproject btspider

cd btspider

scrapy genspider -t basic btspiderx tieba.baidu.com

修改btspiderx内容

# -*- coding: utf-8 -*-
import scrapyfrom btspider.items import BtspiderItemclass BTSpider(scrapy.Spider):name = "btspider"allowed_domains = ["baidu.com"]start_urls = []for x in xrange(91320):if x == 0:url = "https://tieba.baidu.com/f?kw=dota2&ie=utf-8"else:url = "https://tieba.baidu.com/f?kw=dota2&ie=utf-8&pn=" + str(x*50)start_urls.append(url)def parse(self, response):for sel in response.xpath('//div[@class="col2_right j_threadlist_li_right "]'):item = BtspiderItem()item['title'] = sel.xpath('div/div/a/text()').extract()item['link'] = sel.xpath('div/div/a/@href').extract()item['time'] = sel.xpath('div/div/span[@class="threadlist_reply_date pull_right j_reply_data"]/text()').extract()yield item

修改items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass BtspiderItem(scrapy.Item):title = scrapy.Field()link = scrapy.Field()time = scrapy.Field()

这里我们实际上保存的只是title标题内容

修改pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import jsonclass BtspiderPipeline(object):def __init__(self):self.file = codecs.open('info', 'w', encoding='utf-8')def process_item(self, item, spider):# line = json.dumps(dict(item)) + "\n"titlex = dict(item)["title"]if len(titlex) != 0:title = titlex[0]#linkx = dict(item)["link"]#if len(linkx) != 0:#    link = 'http://tieba.baidu.com' + linkx[0]#timex = dict(item)["time"]#if len(timex) != 0:#    time = timex[0].strip()line = title + '\n' #+ link + '\n' + time + '\n'self.file.write(line)return itemdef spider_closed(self, spider):self.file.close()

修改settings.py

BOT_NAME = 'btspider'
SPIDER_MODULES = ['btspider.spiders']
NEWSPIDER_MODULE = 'btspider.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {'btspider.pipelines.BtspiderPipeline': 300,
}

启动爬虫

scrapy crawl btspider

所有的标题内容会被保存为info文件

等到爬虫结束，我们来分析info文件的内容

github上有个示例，改改就能用

git clone https://github.com/FantasRu/WordCloud.git

修改main.py文件如下：

# coding: utf-8
from os import path
import numpy as np
# import matplotlib.pyplot as plt
# matplotlib.use('qt4agg')
from wordcloud import WordCloud, STOPWORDS
import jiebaclass WordCloud_CN:'''use package wordcloud and jiebagenerating wordcloud for chinese character'''def __init__(self, stopwords_file):self.stopwords_file = stopwords_fileself.text_file = text_file@propertydef get_stopwords(self):self.stopwords = {}f = open(self.stopwords_file, 'r')line = f.readline().rstrip()while line:self.stopwords.setdefault(line, 0)self.stopwords[line.decode('utf-8')] = 1line = f.readline().rstrip()f.close()return self.stopwords@propertydef seg_text(self):with open(self.text_file) as f:text = f.readlines()text = r' '.join(text)seg_generator = jieba.cut(text)self.seg_list = [i for i in seg_generator if i not in self.get_stopwords]self.seg_list = [i for i in self.seg_list if i != u' ']self.seg_list = r' '.join(self.seg_list)return self.seg_listdef show(self):# wordcloud = WordCloud(max_font_size=40, relative_scaling=.5)wordcloud = WordCloud(font_path=u'./static/simheittf/simhei.ttf',background_color="black", margin=5, width=1800, height=800)wordcloud = wordcloud.generate(self.seg_text)# plt.figure()# plt.imshow(wordcloud)# plt.axis("off")# plt.show()wordcloud.to_file("./demo/" + self.text_file.split('/')[-1] + '.jpg')if __name__ == '__main__':stopwords_file = u'./static/stopwords.txt'text_file = u'./demo/info'generater = WordCloud_CN(stopwords_file)generater.show()

然后启动分析

python main.py

由于数据比较大，分析时间会比较长，可以拿到廉价的单核云主机上后台分析，等着那结果就好。

下边是我分析两个热门游戏贴吧的词云图片

使用scrapy爬取dota2贴吧数据并进行分析相关推荐

scrapy爬取豆瓣top250电影数据
scrapy爬取豆瓣top250电影数据 scrapy框架 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. sc ...
Scrapy爬取新浪微博用户粉丝数据
一般来说pc端的信息是最为全面的,但是防范措施也是最严格的.所以不能走weibo.com这个域名下进行爬取,新浪微博在pc端的反扒措施较为全面.而手机端的数据则相对好爬取,而且数据都是Json格式,解 ...
使用scrapy爬取京东的手机数据
使用scrapy爬取京东的数据本文目的是使用scrapy爬取京东上所有的手机数据,并将数据保存到MongoDB中一.项目介绍主要目标使用scrapy爬取京东上所有的手机数据将爬取的数据存储 ...
scrapy爬取京东图书的数据
strat_url:https://book.jd.com/booksort.html 文章末尾有完整的项目链接 1.创建项目 scrapy startproject jd_book cd jd_bo ...
scrapy 爬取链家二手房数据
学习使用只爬取展示的展示的3000条数据 spider: # -*- coding: utf-8 -*- from urllib import parse import scrapy from sc ...
Scrapy 爬取七麦 app数据排行榜
目录前言创建项目创建Item 创建Spider 解析付费榜运行爬取初始app列表 Selenium调用JS脚本获取app详情前言熟悉Scrapy之后,本篇文章带大家爬取七麦数据(http ...
使用scrapy爬取前程无忧所有大数据岗位并做出数据可视化
项目目录项目要求工具软件具体知识点具体要求数据源爬取字段数据存储数据分析与可视化具体步骤分析网页实现代码抓取全部岗位的网址字段提取可视化分析"数据分析&quo ...
Scrapy爬取豆瓣图书Top250数据，在PowerBI中可视化分析
文章目录项目说明 Scrapy框架网页分析爬虫代码 items spiders pipelines main 爬取结果 PowerBI分析分析结果项目说明近期在学习Python爬虫,看了很 ...
scrapy爬取途牛网站旅游数据
描述:采取了scrapy框架对途牛网旅游数据进行了爬取,刚开始练手,所以只爬了四个字段用作测试,分别是景点名称.景点位置.景点开放时间.景点描述,爬取结果存的是json格式. 部分数据: 部分代码: ...
起点小说免费看 Scrapy爬取起点小说网数据导入MongoDB数据
本文中我们将详细介绍使用Scrapy抓取数据并存入MongoDB数据库,首先给出我们需要抓取得数据: 抓取起点网得全部作品,网址为:https://www.qidian.com/all 关于Scrap ...

使用scrapy爬取dota2贴吧数据并进行分析

使用scrapy爬取dota2贴吧数据并进行分析相关推荐

最新文章

热门文章