Twitter数据抓取的方法(三)

Scraping Tweets Directly from Twitters Search – Update

Published August 1, 2015

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null &amp;&amp; continueSearch &amp;&amp; !response.getTweets().isEmpty()) { if(minTweet==null) { minTweet = response.getTweets().get(0).getId(); } continueSearch = saveTweets(response.getTweets()); String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId(); if(!minTweet.equals(maxTweet)) { try { Thread.sleep(rateDelay); } catch (InterruptedException e) { e.printStackTrace(); } String maxPosition = "TWEET-" + maxTweet + "-" + minTweet; url = constructURL(query, maxPosition); } }   ...   public final static String TYPE_PARAM = "f"; public final static String QUERY_PARAM = "q"; public final static String SCROLL_CURSOR_PARAM = "max_position"; public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";   public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException { if(query==null || query.isEmpty()) { throw new InvalidQueryException(query); } try { URIBuilder uriBuilder; uriBuilder = new URIBuilder(TWITTER_SEARCH_URL); uriBuilder.addParameter(QUERY_PARAM, query); uriBuilder.addParameter(TYPE_PARAM, "tweets"); if (maxPosition != null) { uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition); } return uriBuilder.build().toURL(); } catch(MalformedURLException | URISyntaxException e) { e.printStackTrace(); throw new InvalidQueryException(query); } }

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html; private String min_position; private String refresh_cursor; private long focused_refresh_interval;

Twitter数据抓取的方法(三)相关推荐

python教程怎么抓起数据_介绍python 数据抓取三种方法
三种数据抓取的方法正则表达式(re库) BeautifulSoup(bs4) lxml *利用之前构建的下载网页函数,获取目标网页的html,我们以https://guojiadiqu.bmcx.co ...
python chrome headless_实战Chrome Headless数据抓取（上）
先聊聊数据抓取技术选型在我看来数据抓取可以分为三种场景: 基本稳定的源站格式或者大量的数据抓取.需要蜘蛛集群调度:使用Java比较方便,可以用WebMagic抓取配合Hadoop调度,如果源站经常改 ...
python 爬虫数据抓取的三种方式
python 爬虫数据抓取的三种方式常用抽取网页数据的方式有三种:正则表达式.Beautiful Soup.lxml 1.正则表达式正则表达式有个很大的缺点是难以构造.可读性差.不易适用未来 ...
[Python爬虫] 三、数据抓取之Requests HTTP 库
往期内容提要: [Python爬虫] 一.爬虫原理之HTTP和HTTPS的请求与响应 [Python爬虫] 二.爬虫原理之定义.分类.流程与编码格式一.urllib 模块所谓网页抓取,就是把URL ...
工业机器人三点工具定位法图文_一种工业机器人的抓取定位方法与流程
本发明涉及工业机器人技术领域,特别是涉及一种工业机器人的抓取定位方法. 背景技术: 机器人视觉主要用计算机来模拟人的视觉功能,并不仅仅是人眼的简单延伸,更重要的是具有人脑的一部分功能.从客观事物的图像 ...
python—简单数据抓取三（简单IP地址代理、利用蘑菇代理实现IP地址代理刷新本地ip地址、利用蘑菇代理实现IP地址代理抓取安居客信息并实现多线程）
学习目标: python学习二十三 -数据抓取三. 学习内容: 1.简单IP地址代理 2.利用蘑菇代理实现IP地址代理刷新本地ip地址 3.利用蘑菇代理实现IP地址代理抓取安居客信息并实现多线程 1. ...
干货！链家二手房数据抓取及内容解析要点
"本文对链家官网网页进行内容分析,可以作为一般HTTP类应用协议进行协议分析的参考,同时,对链家官网的结构了解后,可以对二手房相关信息进行爬取,并且获取被隐藏的近期成交信息." 另 ...
机械臂抓取学习笔记三
论文:Real-Time Deep Learning Approach to Visual Servo Control and Grasp Detection for Autonomous Robot ...
怎么让蜘蛛快速抓取的方法
怎么让蜘蛛快速抓取的方法怎么让蜘蛛快速抓取的方法,网站的SEO不知道怎么做.对于SEO小白来说无从下手的原因是不了解,SEO怎么做,做之前需要了解什么知识,只有一定的知识储备量才会有相对于的解决办法 ...
Python爬虫入门实战之猫眼电影数据抓取(理论篇)
前言本文可能篇幅较长,但是绝对干货满满,提供了大量的学习资源和途径.达到让读者独立自主的编写基础网络爬虫的目标,这也是本文的主旨,输出有价值能够真正帮助到读者的知识,即授人以鱼不如授人以渔,让我们直 ...

Twitter数据抓取的方法(三)

Scraping Tweets Directly from Twitters Search – Update

Twitter数据抓取的方法(三)相关推荐

最新文章

热门文章