Scraping Tweets Directly from Twitters Search – Update

Published August 1, 2015

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null && continueSearch && !response.getTweets().isEmpty()) { if(minTweet==null) { minTweet = response.getTweets().get(0).getId(); } continueSearch = saveTweets(response.getTweets()); String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId(); if(!minTweet.equals(maxTweet)) { try { Thread.sleep(rateDelay); } catch (InterruptedException e) { e.printStackTrace(); } String maxPosition = "TWEET-" + maxTweet + "-" + minTweet; url = constructURL(query, maxPosition); } }   ...   public final static String TYPE_PARAM = "f"; public final static String QUERY_PARAM = "q"; public final static String SCROLL_CURSOR_PARAM = "max_position"; public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";   public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException { if(query==null || query.isEmpty()) { throw new InvalidQueryException(query); } try { URIBuilder uriBuilder; uriBuilder = new URIBuilder(TWITTER_SEARCH_URL); uriBuilder.addParameter(QUERY_PARAM, query); uriBuilder.addParameter(TYPE_PARAM, "tweets"); if (maxPosition != null) { uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition); } return uriBuilder.build().toURL(); } catch(MalformedURLException | URISyntaxException e) { e.printStackTrace(); throw new InvalidQueryException(query); } }

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html; private String min_position; private String refresh_cursor; private long focused_refresh_interval;

Twitter数据抓取的方法(三)相关推荐

  1. python教程怎么抓起数据_介绍python 数据抓取三种方法

    三种数据抓取的方法正则表达式(re库) BeautifulSoup(bs4) lxml *利用之前构建的下载网页函数,获取目标网页的html,我们以https://guojiadiqu.bmcx.co ...

  2. python chrome headless_实战Chrome Headless数据抓取(上)

    先聊聊数据抓取技术选型 在我看来数据抓取可以分为三种场景: 基本稳定的源站格式或者大量的数据抓取.需要蜘蛛集群调度:使用Java比较方便,可以用WebMagic抓取配合Hadoop调度,如果源站经常改 ...

  3. python 爬虫 数据抓取的三种方式

    python 爬虫   数据抓取的三种方式 常用抽取网页数据的方式有三种:正则表达式.Beautiful Soup.lxml 1.正则表达式 正则表达式有个很大的缺点是难以构造.可读性差.不易适用未来 ...

  4. [Python爬虫] 三、数据抓取之Requests HTTP 库

    往期内容提要: [Python爬虫] 一.爬虫原理之HTTP和HTTPS的请求与响应 [Python爬虫] 二.爬虫原理之定义.分类.流程与编码格式 一.urllib 模块 所谓网页抓取,就是把URL ...

  5. 工业机器人三点工具定位法图文_一种工业机器人的抓取定位方法与流程

    本发明涉及工业机器人技术领域,特别是涉及一种工业机器人的抓取定位方法. 背景技术: 机器人视觉主要用计算机来模拟人的视觉功能,并不仅仅是人眼的简单延伸,更重要的是具有人脑的一部分功能.从客观事物的图像 ...

  6. python—简单数据抓取三(简单IP地址代理、利用蘑菇代理实现IP地址代理刷新本地ip地址、利用蘑菇代理实现IP地址代理抓取安居客信息并实现多线程)

    学习目标: python学习二十三 -数据抓取三. 学习内容: 1.简单IP地址代理 2.利用蘑菇代理实现IP地址代理刷新本地ip地址 3.利用蘑菇代理实现IP地址代理抓取安居客信息并实现多线程 1. ...

  7. 干货!链家二手房数据抓取及内容解析要点

    "本文对链家官网网页进行内容分析,可以作为一般HTTP类应用协议进行协议分析的参考,同时,对链家官网的结构了解后,可以对二手房相关信息进行爬取,并且获取被隐藏的近期成交信息." 另 ...

  8. 机械臂抓取学习笔记三

    论文:Real-Time Deep Learning Approach to Visual Servo Control and Grasp Detection for Autonomous Robot ...

  9. 怎么让蜘蛛快速抓取的方法

    怎么让蜘蛛快速抓取的方法 怎么让蜘蛛快速抓取的方法,网站的SEO不知道怎么做.对于SEO小白来说无从下手的原因是不了解,SEO怎么做,做之前需要了解什么知识,只有一定的知识储备量才会有相对于的解决办法 ...

  10. Python爬虫入门实战之猫眼电影数据抓取(理论篇)

    前言 本文可能篇幅较长,但是绝对干货满满,提供了大量的学习资源和途径.达到让读者独立自主的编写基础网络爬虫的目标,这也是本文的主旨,输出有价值能够真正帮助到读者的知识,即授人以鱼不如授人以渔,让我们直 ...

最新文章

  1. 地面标识检测与识别算法
  2. numpy维度交换_numpy之转置(transpose)和轴对换
  3. jquery封装插件
  4. Node的垃圾回收机制与内存溢出捕获(上)
  5. Linux 脚本编写基础(二)
  6. 004-SLF4J的简单使用
  7. WINCE创建快捷方式
  8. 【MPI编程】任意数节点的树形求和(高性能计算)
  9. 数据结构实验之二叉树六:哈夫曼编码
  10. 夺命雷公狗---微信开发09----玩转单图文消息回复
  11. 计算机底层逻辑无法仿造大脑,重塑世界的底层逻辑|读《终极算法》
  12. Porsche保时捷Taycan维修手册电路图接线图技术培训手册维修技术资料
  13. html怎么隐藏音频的图标,XP系统realtek高清晰音频管理器图标如何隐藏
  14. 密码学——培根密码和栅栏密码
  15. RGB与CMYK两种色彩模式的区别
  16. 八位一体共阳极数码管显示电子时钟+闹铃+温度检测
  17. 华为数通笔记-BGP选路与负载分担
  18. 你也能成为 “最强大脑”
  19. Java - 日期和时间:如何取得年月日、时分秒?如何取得从1970年1月1日0时0分0秒到现在的毫秒数?如何取得某月的最后一天?如何格式化日期?
  20. 一些实用的功能强大的绘图软件

热门文章

  1. 【推荐】互联网或技术多平台,一文多发小工具!
  2. 【论文解读】[目标检测]retinanet
  3. 《深入理解计算机系统》读书笔记(ch1)
  4. 内置函数dict()字典
  5. Math,Number
  6. python自动化,自动登录并且添加一个门店
  7. Atitit 衡量项目的规模
  8. C#读取网络流,读取网络上的js文件
  9. AndroidStudio学习
  10. js判断是手机访问还是电脑访问,进行自动跳转