Twitter数据抓取的方法(二)
Scraping Tweets Directly from Twitters Search Page – Part 2
Published January 11, 2015
In the previous post we covered effectively the theory of how we can search and extract tweets from Twitter without having to use their API.
First, let’s have a quick recap of what we learned in the previous post. We have a URL that we can use to search Twitter with:
https://twitter.com/i/search/timeline
This includes the following parameters:
Key | Value |
---|---|
q | URL encoded query string |
f | Type of query (omit for top results or realtime for all) |
scroll_cursor | Allows to paginate through results. If omitted it returns first page |
We also know that Twitter returns the following JSON response:
{ has_more_items: boolean, items_html: "...", is_scrolling_request: boolean, is_refresh_request: boolean, scroll_cursor: "...", refresh_cursor: "...", focused_refresh_interval: int } |
Finally, we know that we can extract the following information for each tweet:
Selector | Value |
---|---|
div.original-tweet[data-tweet-id] | The authors twitter handle |
div.original-tweet[data-name] | The name of the author |
div.original-tweet[data-user-id] | The user ID of the author |
span._timestamp[data-time] | Timestamp of the post |
span._timestamp[data-time-ms] | Timestamp of the post in ms |
p.tweet-text | Text of Tweet |
span.ProfileTweet-action–retweet > span.ProfileTweet-actionCount[data-tweet-stat-count] | Number of Retweets |
span.ProfileTweet-action–favorite > span.ProfileTweet-actionCount[data-tweet-stat-count] | Number of Favourites |
Ok, recap done, let’s consider some pseudo code to get us started. As the example is going to be in Java, the pseudo code will take on a Java syntax.
|
Firstly, we define a function called searchTwitter, where we pass a query value as a string, and a specified time to pause the thread between calls. Given this string, we then pass to a function that creates our search URL based on our query. Then, in a while loop, we execute the search to return a TwitterResponse object that represents the JSON Twitter returns. Checking that the response is not null, it has more items, and we are not repeating the scroll cursor, we proceed to extract tweets from the items html, save them, and create our next search URL. We finally sleep the thread for however long we choose to with rateDelay, so we are not bombarding Twitter with a stupid amount of requests that could be viewed as a very crap DDOS.
Now that we’ve got an idea of what algorithm we’re going to use, let’s start coding.
I’m going to use Gradle as a the build system, as we are going to use some additional dependencies to make things easier. You can either download it and set it up on your machine if you want, but I’ve also added a Gradle wrapper (gradlew) to the repository so you can run without downloading Gradle. All you’ll need is to make sure that you’re JAVA_HOME Path variable is set up and pointing to wherever Java is located.
Lets take a look at the Gradle file.
apply plugin: 'java'
sourceCompatibility = 1.7
version = '1.0'
repositories { mavenCentral() } dependencies { compile 'org.apache.httpcomponents:httpclient:4.3.6' compile 'com.google.code.gson:gson:2.3' compile 'org.jsoup:jsoup:1.7.3' compile 'log4j:log4j:1.2.17' testCompile group: 'junit', name: 'junit', version: '4.11' }
|
As this is Java project, we’ve applied the java plugin. This will generate our standard directory structure that we get with Gradle and Maven projects: src/main/java src/test/java.
In addition, there are several dependencies I’ve included to help make the task a little easier. HTTPClient provides libraries that make it easier to construct URI’s, GSON is a useful JSON processing library that will allow us to convert the response query from Twitter into a Java object, and finally JSoup is an HTML parsing library that we can use to extract what we need from the inner_html value that Twitter returns to us. Finally, I’ve included JUnit, however I won’t go into unit testing with this example.
Lets start writing our code. Again, if you’re not familiar with gradle, the root for your packages should be in src/main/java. If the folders are not already there, you can auto generate, although feel free to look at the example code if you’re still unclear.
|
|
You’ll notice the additional method getTweets() in TwitterResponse. For now, just return an empty ArrayList, but we will revisit this later.
In addition to these bean classes, we also want to consider an edge case where people might use this to search for an empty, null string, or the query contains characters not allowed in a URL. Therefore to handle this, we will also create a small Exception class called InvalidQueryException.
|
Next, we need to create a TwitterSearch class and it’s basic structure. An important thing to consider here is we are interested in making the code reusable, so in the example I have made this abstract with an abstract method called saveTweets. The nice thing about this is it decouples the saving logic from the extraction logic. In other words, this will allow you to implement your own save solution without having to rewrite any of the TwitterSearch code. Additionally, you might also note that I’ve specified that the saveTweets method returns a boolean. This will allow anyone extending this to provide their own exit condition, for example once a certain number of tweets have been extracted. By returning false, we can indicate in our code to stop extracting tweets from Twitter.
|
Finally, lets also create a TwitterSearchImpl. This will contain a small implementation of TwitterSearch so we can test our code as we go along.
|
All this implementation does is print out our tweets date and text, collecting up to a maximum of 500 where the program should terminate.
Now we have the skeleton of our project set up, lets start implementing some of the functionality. Considering our pseudo code from earlier Let’s start with TwitterSearch.class:
|
As you can probably tell, that is pretty much most of our main pseudo code implemented. Running it will have no effect, as we haven’t implemented any of the actual steps yet, but it is a good start.
Lets implement some of our other methods starting with constructURL.
|
First, we make a check to see if the query is valid. If not, we’re going to throw that InvalidQuery exception from earlier. Additionally, we may throw a MalformedURLException or URISyntaxexception, both caused by an invalid query string, so when caught we shall throw a new InvalidQuery exception. Next, using a URIBuilder, we build our URL using some constants we specify as variables, and the query and scroll_cursor value we pass. With our initial queries, we will have a null scroll cursor, so we also check for that. Finally, we build the URI and return as a URL, so we can use it to open up an InputStream later on.
Lets implement our executeSearch function. This is where we actually call Twitter and parse its response.
|
This is a fairly simple method. All we’re doing is opening up a URLConnection for our Twitter query, then parsing that response using Gson as a TwitterResponse object, serializing the JSON into a Java object that we can use. As we’ve already implemented the logic earlier for using the scroll cursor, if we were to run this now, rather than the program terminating after a few seconds, it will keep running till there is no longer a valid response from Twitter. However, we haven’t quite finished yet as we have yet to extract any information from the tweets.
The TwitterResponse object is currently holding all the twitter data in it’s items_html variable, so what we now need to do is go back to TwitterResponse and add in some code that lets us extract that data. If you remember from earlier, we added a getTweets() method to the TwitterResponse object, however it’s returning an empty list. We’re going to fully implement that method so that when called, it builds up a list of tweets from the response inner_html.
To do this, we are going to be using JSoup, and we can even refer to some of those CSS queries that we noted earlier.
|
Let’s discuss what we’re doing here. First, we’re create a JSoup document from the items_html variable. This allows us to select elements within the document using css selectors. Next, we are going through each of the li elements that represent each tweet, and then extracting all the information that we are interested in. As you can see, there’s a number of catch statements in here as we want to check against edge cases where particular data items might not be there (i.e. user’s real name), while at the same time not using an all encompassing catch statement that will skip tweets if it is just missing a singular piece of information. The only value that we require to save the tweet here is the tweetId, as this allows us to fully extract information about the tweet later on if we so want. Obviously, you can modify this section to your hearts content to meet your own rules.
Finally, lets re run our program again. This is the final time, and you should now see tweets being extracted and printed out. That’s it. Job done, finished!
Obviously, there are many ways this code can be improved. For example, a more generic error checking methodology could be implemented to check against missing attributes (or you could just use groovy and ?). You could implement runnable in the TwitterSearch class to allow multiple calls to Twitter with a ThreadPool (although, I stress respect rate limits). You could even change TwitterResponse so it serializes the tweets as a list on creation, rather than extracting them from items_html each time you access them.
Twitter数据抓取的方法(二)相关推荐
- python教程怎么抓起数据_介绍python 数据抓取三种方法
三种数据抓取的方法正则表达式(re库) BeautifulSoup(bs4) lxml *利用之前构建的下载网页函数,获取目标网页的html,我们以https://guojiadiqu.bmcx.co ...
- 【RPA入门教程】UiBot数据抓取功能使用教学(二)
数据抓取功能使用说明 点击 UiBot 编辑器工具栏的[数据抓取]按钮,打开数据抓取工具 数据抓取工具需要先选取一个目标,点击选择目标按钮即可. 这个目标就是要采集的数据字段,如果要采集商品名,则先选 ...
- 查询数据 抓取 网站数据_有了数据,我就学会了如何在几个小时内抓取网站,您也可以...
查询数据 抓取 网站数据 I had a shameful secret. It is one that affects a surprising number of people in the da ...
- [Python爬虫] 三、数据抓取之Requests HTTP 库
往期内容提要: [Python爬虫] 一.爬虫原理之HTTP和HTTPS的请求与响应 [Python爬虫] 二.爬虫原理之定义.分类.流程与编码格式 一.urllib 模块 所谓网页抓取,就是把URL ...
- 怎么让蜘蛛快速抓取的方法
怎么让蜘蛛快速抓取的方法 怎么让蜘蛛快速抓取的方法,网站的SEO不知道怎么做.对于SEO小白来说无从下手的原因是不了解,SEO怎么做,做之前需要了解什么知识,只有一定的知识储备量才会有相对于的解决办法 ...
- 李沐【实用机器学习】1.3网页数据抓取
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档 文章目录 前言 一.数据抓取工具 二.实例解析 总结 前言 网页数据抓取目标:在一个网站里面感兴趣的数据抓取出来 数据特点:噪点较多, ...
- 基于学习的平面抓取检测方法分类及讨论
平面抓取检测的任务是,输入感知数据,输出抓取配置.到目前,基于学习的平面抓取检测方法主要分为两类: (1)一阶段的端到端学习方法. (2)两阶段的学习方法. 1.一阶段学习法 在这类方法中,直接学习从 ...
- 不同厂商抓取日志方法
本文原为转载文章,以下手机未做一一验证(括弧有说明哪些有验证),如有错误之处,欢迎评论指出. 各位小伙伴们 为提高应用的稳定性,需要测试美眉帮忙抓取日志来帮开发哥哥定位问题原因,而各个手机的rom各不 ...
- 基于Hadoop的豆瓣电影的数据抓取、数据清洗、大数据分析(hdfs、flume、hive、mysql等)、大屏可视化
目录 项目介绍 研究背景 国内外研究现状分析 研究目的 研究意义 研究总体设计 数据获取 网络爬虫介绍 豆瓣电影数据的采集 数据预处理 数据导入及环境配置 Flume介绍 Hive介绍 MySQL介绍 ...
- java分页抓取数据_网页分页数据抓取的几种方式
相信所有个人网站的站长都有抓取别人数据的经历吧,目前抓取别人网站数据的方式无非两种方式: 一.使用第三方工具,其中最著名的是火车头采集器,在此不做介绍. 二.自己写程序抓取,这种方式要求站长自己写程序 ...
最新文章
- 小米路由器mini改打印服务器_如何把家里的闲置路由器用起来
- hook_theme 的重要性
- C#正则表达式编程(四)转致周公
- java学习(99):车站卖票问题
- 论优秀的码农,学会这5点!
- 循环赛日程安排(构造、分治)
- snmp服务配置及其oid、mib文件解析
- Jni开发(二)Linux运行java测试代码
- Linux内存管理之mmap
- 名编辑电子杂志大师教程 | PDF制作排版设计建议
- jQuery实现验证码60秒倒计时
- C语言:下载并安装编译器(MinGW-W64 GCC)win10环境
- javaScript常用案例
- 机器学习算法原理与编程实践-郑捷著 读书笔记—第一章part1
- 知名游戏设计师的 GitHub 仓库被删,CEO 道歉;工信部向四家公司发放 5G 牌照
- 四层PCB核心板制作3——层叠管理
- 【Java练习】2022个人所得税计算
- redis setIfAbsent和 setnx 的区别与使用
- 孙陶然:不要轻易开始创业
- 数据结构与算法A 查找
热门文章
- linux生成密码文本,Linux下用makepasswd和passwordmaker生成密码
- UFT QTP 12 试用
- Mybatis中的resultType和resultMap
- nginx + gunicorn + django 2.0 踩坑
- OpenGL ES 中的模板测试
- PHP 后台程序配置config文件,及form表单上传文件
- 在 Linux 平台中调试 C/C++ 内存泄漏方法
- memcached—认识Memcache
- 点击微信网页的a标签直接跳转到淘宝APP打开怎么实现的?附:动图演示效果
- SqlServer查询表中某列相同值的最近记录