论文摘要, Filtering microblogging messages for Social TV, A Bootstrapping Approach to Identifying Relevant Tweets for Social TV

Social TV was named one of the ten most important emerging technologies in 2010 by the MIT Technology Review.

Social Television is a general term for technology that supports communication and social interaction in either the context of watching television, or related to TV content.

Some of these systems allow users to read microblogging messages related to the TV program they are currently watching.

所以这儿讨论的问题就是, 怎么样过滤出真正和TV相关的信息, 最简单的, 而且也是我们一直使用的方法如下,

Current Social TV applications search for these messages by issuing queries to social networks with the full title of the TV program. This naive approach can lead to low precision and recall.

举个简单了例子就可以明白, 这个方法为啥precision and recall都很低...

The popular TV show House is an example that results in low precision.

对于House, 这是个有歧义的词(ambiguous), 除了表示TV节目外, 在不同的语境下有很多其它的用途, 如White House, House of Representatives, building, home, etc. 所以直接搜索House必然是low precision.

Continuing with our example for the show House, there are many messages which do not mention the title of the show but make references to users, hashtags, or even actors and characters related to the show. The problem of low recall is more severe for shows with long titles.
上面说了recall问题对于title比较长的tv非常明显, 很少有人愿意在tweet写全title, 往往会使用缩写.

总结一下, 我们要解决这个问题的挑战如下,
Our task is to retrieve microblogging messages relevant to a given TV show with high precision. Filtering messages from microblogging websites poses several challenges, including:

1. Microblogging messages are short and often lack context. For instance, Twitter messages (tweets) are limited to 140 characters and often contain abbreviated expressions such as hashtags and short URLs.

2. Many social media messages lack proper grammatical structure. Also, users of social networks pay little attention to capitalization and punctuation. This makes it difficult to apply natural language processing technologies to parse the text.

3. Many social media websites offer access to their content through search APIs, but most have rate limits. In order to filter messages we first need to collect them by issuing queries to these services. For each show we require a set of queries which provides the best tradeoff between the need to cover as many messages about the show as possible, and the need to respect
the API rate limits imposed by the social network. Such queries could include the title of the show and other related strings such as hashtags and usernames related to the show. Determining which keywords best describe a TV show can be a challenge.

4. In the last decade alone, television networks have aired more than a thousand new TV shows. Obtaining training data for every show would be prohibitively expensive. Furthermore, new shows are aired every six months.

这个问题怎么解决, 我之前也想了很久, 我也想过要建立一个分类器来区分一条tweet是否是关于tv的, 但是没有想好具体怎么做, 这篇paper就是提出了一个怎么样建立这个分类器的方法.

分类器是个很成熟的技术, 关键就是特征的选取和训练集的收集.

We propose a bootstrapping method which is built upon 1) a small set of labeled data, 2) a large unlabeled dataset, and 3) some domain knowledge, to form a classifier that can generalize to an arbitrary number of TV shows.

由于lable训练集是个耗时的工作, 所以这儿只需要较小的训练集labeled data, 并通过domain knowledge来选取初始的分类特征, 这样可以完成初始的分类器的训练.然后用a large unlabeled dataset作为测试集来测试初始分类器, 在测试过程中发现新的特征, 并不断的完善, 形成可用的improved分类器.

这就是这个方法的大体思想, 并且通过测试, 可以发现improved后的分类器在recall上有很大的提高.

个人觉得这篇paper的价值就在于特征的选取, 下面就看看会选取哪些特征,

Terms related to TV watching

General terms commonly associated with watching TV. 这类特征通过手工收集, 包含如下3个特征,

tv_terms, general terms such as watching, episode, hdtv, netflix, etc.

network_terms, contains names of television networks such as cnn, bbc, pbs, etc.

season_episode,

Some users post messages which contain the season and episode number of the TV show they are currently watching.

“S06E07”, “06x07” and even “6.7” are common ways of referring to the sixth season and the seventh episode of a particular TV show. 所以我们要通过regular expressions来定位是否包含season_episode

对于以上特征, 在tweet中包含相应term时特征为1, 否则为0.

General Positive Rules

rules_score ,
The motivation behind the rules_score feature is the fact that many messages which discuss TV shows follow certain patterns.
如,
<start> watching <show_name>
episode of <show_name>
<show_name> was awesome

如果我们有这样的一个rule列表, 当tweet中包含相应rule时特征为1, 否则为0.

问题是我们怎样找到这些rule, 当然可以人工一个个去发现, 这样也可以准确率比较高, 不过效率太低.

We developed an automated way to extract such general rules and compute their probability of occurrence.

We start from a manually compiled list of ten unambiguous TV show titles, such as “Mythbusters”, “The Simpsons”, “Grey’s Anatomy”, etc. unambiguous 就是没有歧义, 明确的, 这个词一定代表某一个tv的, 相对于ambiguous, 如House
现在我们想要提取tv相关的tweets中的general rules, 所以必须保证找到的tweets是真正和tv相关的, 比较好的办法就是通过unambiguous TV show来收集, 这个方法我们之前也使用过.

For each message which contained one of these titles, the algorithm replaced the title of TV shows, hashtags, references to episodes, etc. with general placeholders, then computed the occurrence of trigrams around the keywords.

这个是关键的一步, 我们需要提取general rules, 所以要先把和某个具体tv相关的信息都屏蔽掉, 然后统计trigrams 的occurrence

Features related to show titles

Although many social media messages lack proper capitalization,when users do capitalize the titles of the shows this can be used as a feature.

title_case, which is set to 1 if the title of the show is capitalized, otherwise it has the value 0.

titles_match, any of the titles mentioned in the message are unambiguous, we can set the value of this feature to 1.

这儿比较有价值的是, 他提出了一个怎么样判断是否unambiguous的方法, 我们之前通过自己统计stop word的方法, 不过效果不是很好, 尤其是对多个词的时候, 他提出可以利用WordNet……Good.

We define unambiguous title to be a title which has zero or one hits when searching for it in WordNET

Features based on domain knowledge crawled from online sources

One of our assumptions is that messages relevant to a show often contain names of actors, characters, or other keywords strongly related to the show.

cosine_characters, cosine_actors, and cosine_wiki, we compute the cosine similarity between a new message and the information we crawled (from TV.com and Wikipedia) about the show for each of the three features.

这个方法可用大大提高recall, 不过实现起来比较麻烦, 而且由于twitter的访问限制, 也不允许为一个show设置太多的term, 所以一直没有采用.

上面就列出了9个初始特征, 然后通过使用初始分类器对测试集进行测试后, 又发现如下特征,

pos_rules_score and neg_rules_score are natural extensions of the feature rules_score.

For instance, for the show House we can now learn positive rules such as episode of house, as well as negative rules such as in the house or the white house.

users_score and hashtags_score

Using messages labeled by Classifier #1, we can determine commonly occurring hashtags and users which often talk about a particular show. Furthermore, these features can also help us expand the set of queries for each show, thus improving the recall by searching for hashtags and users related to the show, in addition to the title.

这点我们之前也想到过, 只是没有实现, 可以提高recall

rush_period, this feature is based on the observation that users of social media websites often discuss about a show during the time it is on air.When classifying a new message we check how many mentions of the show there were in the previous window of 10 minutes. 超过某一threshold设为1, 否则设为0.

转载于:https://www.cnblogs.com/fxjwind/archive/2011/08/02/2125283.html

Filtering microblogging messages for Social TV相关推荐

  1. Twitter, 微博相关文献

    [转]Twitter及微博研究的一组文献 Bibliography of Research on Twitter & Microblogging Alfred Hermida. (2010). ...

  2. 全球数百万台 Mac 疑似因 Big Sur 更新险酿计算灾难,苹果官方回应来了!

    整理 | 夕颜 出品 | CSDN(ID:CSDNnews) 近日,苹果在发布会上推出了数款专用芯片M1支持的Mac新品,包括Mac book.MacBook Pro和Mac mini系列.随之一起重 ...

  3. VS2012编译调试WDM驱动(KdPrint无调试信息 debugview win7无调试信息)

    对于WDM驱动 VS2012有向导可以新建WDM项目 如图 这点说明不用自己配置 文件目录 C/C++ 选项 LINK 选项 等一系列的参数 比以前方便了不少 新建以后是空项目 放入<windo ...

  4. apache synapse使用(1)

    一.Synapse介绍 Synapse 是一个简单的 XML 和 Web 服务管理与集成代理,可用于构成 SOA 和企业服务总线(ESB)的基础.Synapse是 Web 服务项目中一项成熟的 Apa ...

  5. windows双机调试

    目标计算机(windows 7 串口调试): 1. 管理员帐号登陆 2.设置调试方式 串口参数 bcdedit /dbgsettings serial baudrate:115200 debugpor ...

  6. Win7 32位下DebugView和DriverMinitor不能打印调试信息的问题

    使用DebugView打印内核调试信息是开发驱动的非常重要的手段,但DebugView在VISTA/WINDOWS 7下却无法获取内核的调试日志,修改方法是: HKLM/SYSTEM/CurrentC ...

  7. TinyLog –轻量级Java日志记录框架教程

    TinyLog is a simple and lightweight logging framework for Java. We can use tinylog with Java, Kotlin ...

  8. CDRouter IPv6 Test Case

    2019独角兽企业重金招聘Python工程师标准>>> IPv6 basic-v6 (16) Basic IPv6 extension header processing tests ...

  9. 2019_WSDM_Social Attentional Memory Network Modeling Aspect- and Friend-level Differences in Recomme

    [论文阅读笔记]2019_WSDM_Social Attentional Memory Network Modeling Aspect- and Friend-level Differences in ...

最新文章

  1. 年近 40,我在互联网大厂做高龄“大头兵”
  2. 推荐系统中的Bias/Debias大全
  3. Kubernetes 的自动伸缩你用对了吗?
  4. Android 系统(74)--Android重启原因分析
  5. 9.2.2、Libgdx的输入处理之事件处理
  6. 关于SqlServer导入access数据库,十进制字段的精度过小的问题
  7. 《记》rxjs分流操作符简单实现
  8. pcm a律编码 c语言,PCM音频编码
  9. 基于RSelenium爬取中国裁判文书网文书数据
  10. Powerdesigner显示表的comment和列的comment的方法
  11. 关于把数据库放在阿里云上,实现共享
  12. 故事版(StoryBoard)的学习-----使用prepareForSegue方法
  13. 石川:出色不如走运 (II)?
  14. 使用Java语言打印一个爱心图案
  15. (金融入门知识点)Double类型丢失精度
  16. 国产处理器性能再提升,与Intel差距不大,替代时机日益接近
  17. 毕业设计-基于深度学习的交通标识识别-opencv
  18. [转]Mysql在大型网站的应用架构演变
  19. BIOS自检与开机故障相关的详尽问答集
  20. 技术爆炸时代如何做技术的掌控者?

热门文章

  1. F1.4大光圈专业相机-海鸥CF100
  2. Silverlight的跨站策略和跨站策略文件
  3. linux之NTP服务
  4. MySQL优化之三:SQL语句优化
  5. 中国高性能计算机TOP100出炉 曙光联想并列第一
  6. 能拯救你的人也只能是自己
  7. AFNetworking 3.0 源码解读(十)之 UIActivityIndicatorView/UIRefreshControl/UIImageView + AFNetworking...
  8. linux增加epel源,yum安装nignx,脚本安装mysql服务端,shell脚本监控网站页面
  9. 【Laro】- About Game Engine
  10. 合适是最好,声音选项里面 声音的硬件加速到底有什么用