Emotion, Event detection

  • Emotion Detection
    • Second order (derived) emotions:
    • Sentiment vs emotion:
    • Emotion detection
      • Problems in data collection
      • Use hashtag based on data collection
      • emoticons
    • 整体流程
  • Event Detection
    • Supplementary
    • Useful of event detection
    • Topic Detection Tracking (TDT)
      • Basic TDT Clustering Approach
    • Event Detection
      • Event
      • Current approaches
      • ENTITY-BASED Event detection
        • Entity
      • Evaluation of ENTITY-BASED Event detection

Emotion Detection

What triggers emotions?
stimulus event

  • external

    • natural phenomena
    • other people’s behavior
  • internal
    • Neuroendocrine or Physiological Changes
    • memories

Universal emotion categories:
anger, disgust, fear, happiness, sadness and surprise,
also are basic (primary) emotions
can be reduced to 4 categories:

  • happiness, sadness, fear/surprise, and anger/disgust

Second order (derived) emotions:

Emotional states that are not so basic, like chagrin, irritation

Sentiment vs emotion:

  • Sentiments can be formed and retained for longer time

  • Emotion lasts for shorter time

  • Sentiments are target centred, hence directed

  • Emotions are not target centred

  • A text can have multiple emotions

Emotion detection

  • eg: '7 dead in apartment building fire’
  • Anger 10%, disgust 1%, fear 30%, joy 0, sadness 50% surprise 5%

Problems in data collection

  • Uncertainty, incompleteness and even mistakes among the ground truth label due to annotators expertise or task’s difficulty

Use hashtag based on data collection

  • Direct access to user’s intent
  • List the emotion hashtags of 28 affected words or extend the list by WordNet synsets
  • Collect tweets that contain one or more hashtags that fall in the defined list of emotions hashtags
  • Consider tweets only with hashtags
  • Add score based on this
  • 直接访问用户的意图
  • 列出 28 个受影响词的情感标签或通过 WordNet 同义词集 扩展列表
  • 收集包含一个或多个主题标签的推文,这些主题标签属于已定义的情绪标签列表
  • 只考虑带有主题标签的推文
  • 根据标签算分

emoticons

符号表示心情例如

https://emojipedia.org/twitter/

  • List of emoticons and their associations to eight emotions
  • Annotate a tweet into that category if the emoticons appear at the end
    Sometimes we meet conflicts, (happy+sad)
  • 表情符号列表及其与八种情绪的关联
  • 如果表情符号出现在末尾,则将推文注释到该类别中
    有时我们会遇到冲突,(快乐+悲伤),这时候需要比较 emotion word lexicon,hashtags lexicon 和 Emoticon lexicon 的score,哪个高就分给哪个分类

整体流程

Event Detection

Supplementary

  • Tokenization, Lemmazation, Stopword

  • Named Entity Recognition
    i)Detect a named entity
    ii) Categorize the entity

    • Person
    • Organization
    • Time
    • Location
  • Part-of-Speech Tagging

  • Word sense disambiguation

  • Textual Entailment
    Extract a directional relation between text fragments
    提取文本片段之间的方向关系
    If you help the needy, God will reward you →\rightarrow→

    • Giving money to a poor man has good consequences
    • Giving money to a poor man has no consequences
    • Giving money to a poor man will make you better person
  • Automatic summarization
    Extract a readable summary from text (news, articles, documents…)
    从文本中提取可读的摘要(新闻、文章、文件……)

  • Sentiment Analysis
    Exact subjective polarity from documents: positive, negative, or neutral
    来自文档的确切主观极性:正面、负面或中性

  • Vector-Space Representation of Documents
    TDM or DTM matrix
    维基百科
    https://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95

    “对相同的文字,我们得到后面这些完全反向索引,由文档数量和当前查询的单词结果组成的的成对数据。 同样,文档数量和当前查询的单词结果都从零开始。所以,“banana”: {(2, 3)} 就是说 "banana"在第三个文档里 ({\displaystyle T_{2}}T_{2}),而且在第三个文档的位置是第四个单词(地址为 3)。”

  • Inverted File / Index

An Inverted File (or Inverted Index) is a documentterm matrix representation “inverted” so that rows become columns and columns become rows
倒排文件(或倒排索引)是“倒置”的文档术语矩阵表示,因此行变为列,列变为行

  • TF-IDF
    Importance of Term to Document in Collection (Corpus):

    • Term Frequency (TF): Count of times a term appears in a document.
    • Inverse Document Frequency (IDF): determine whether a term is common or rare across all documents.

      术语对集合中文档的重要性(语料库)

    • 词频 (TF):词出现在 文档。
    • 逆向文档频率 (IDF):确定是否 一个术语在所有文档中都很常见或很少见。

Useful of event detection

Delay between new event appearing on Twitter and the time taken for the same event to be updated on Wikipedia

  • On average twitter is about 2 hours ahead of Wikipedia
  • Twitter’s real-time nature has allowed it to be used as a co-ordination tool for protestors and demonstrators
  • Event detection & Stock Movements
  • A tool which automatically detect, track and organize these events would be valuable to Journalists, Finance (equities, forex, commodities, even cryptocurrencies), Security and Intelligence Services

出现在 Twitter 上的新事件与 同一事件将在 Wikipedia 上更新

  • 平均而言,推特比维基百科早约 2 小时
  • Twitter 的实时性使其可以用作协调工具 为抗议者和示威者
  • 事件检测和股票走势
  • 自动检测、跟踪和组织这些事件的工具对记者、金融(股票、外汇、大宗商品,甚至加密货币)、安全和情报服务很有价值

Topic Detection Tracking (TDT)

  • Monitoring broadcast news
  • Newswire documents
  • Almost all systems used a version of online, nearest neighbor clustering
  • 监控广播新闻
  • 新闻专线文件
  • 几乎所有系统都使用在线最近邻集群的一个版本

Basic TDT Clustering Approach

  • Compare it to every news article that has been seen before
    Cosine Similarity
  • If a similar article was found (i.e. they discuss the same event):
    • Add the new article to the same cluster as the most similar article
  • If no similar article was found (i.e. it’s a newevent):
    • Create a new cluster and add the new article to it
      \
  • 将其与以前看过的每篇新闻文章进行比较
    Cosine Similarity
  • 如果发现了类似的文章(即他们讨论相同的事件):
    • 将新文章添加到与最相似文章相同的集群中
  • 如果没有找到类似的文章(即这是一个新事件):
    • 创建一个新集群并将新文章添加到其中

Cosine Similarity
sim(A,B)=cos(θ)=A⋅B∣∣A∣∣∣∣B∣∣sim(A,B)=cos(\theta)=\frac{A\cdot B}{||A||||B||}sim(A,B)=cos(θ)=∣∣A∣∣∣∣B∣∣A⋅B​

Issueswith the TDT approach

  • Clustering

    • Clustering is slow
    • Some are bad and some groups are not that bad
    • Growth of groups
    • Order of documents
  • Assumes all content is newsworthy

Event Detection

Event

An event is a significant thing that happens at some specific time and place.
事件是发生在一些特定的时间和地点
Issues

  • Insignificant and mundane events

    • newsworthiless
    • Fragmented events
  • 微不足道和平凡的事件
    • 没有新闻价值
    • 碎片化事件

How to solve
Entity based approach (McMinn et al)

  • Aggressive filtering of mundane events
  • More structured approach to event detection
  • Event can contain several entities and topics
  • Reducing likelihood that a single real-world event can be detected as several real-world events

基于实体的方法(McMinn)

  • 积极过滤平凡的事件
  • 更结构化的事件检测方法
  • 事件可以包含多个实体和主题
  • 降低单个现实世界事件被检测为多个现实世界事件的可能性

Current approaches

  • Monitor events real-time
  • Locality Sensitive Hashing (LSH) ---- scalable, real-time event detection, proposed by Petrovic et. al
    • Places similar documents into buckets of a hash table
    • Nearest neighbor with a high probability
    • Clustering can then be done real-time with variance reduction technique
  • LSH can be thought of replacing the inverted index shown in the pseudocode from earlier, it does the same job as the inverted index but much more efficiently reduces the number of comparisons that need to be made
    \
  • 实时监控事件
  • Locality Sensitive Hashing (LSH) ---- 可扩展的实时事件检测,由 Petrovic 等人提出
    • 将相似的文档放入哈希表的桶中
    • 最近的邻居有高概率
    • 然后可以使用方差减少技术实时进行聚类
    • LSH 可以用来替换之前伪代码中显示的倒排索引,它与倒排索引相同,但更有效地减少了需要进行的比较次数

ENTITY-BASED Event detection

Entity

A thing with distinct and independent existence.
An entity is any singular, identifiable and separate object. For example, a particular person, organization or location.
具有鲜明而独立存在的事物。
实体是任何单一的、可识别的和独立的对象。 例如,一个 特定的人、组织或地点。

Entities are used to be used to

  • Sports event detection
  • Domain knowledge exploited

Pipeline

  • Pre-processing

    • Remove noise, redundant tweets (Filtering)
    • Filters out unwanted tweets such as advertisements
    • Parsing and tagging (Part of Speech Tagging (POS), Named Entity Recognition (NER))
    • remove retweets
  • Clustering

    • the most similar tweets will always discuss the same entities
    • For each entity we pull out documents/tweets they are in, add them to this inverted index, then, tweets get added to an index for each named entity they contain
    • Tweets are clustered on a per-entity basis
  • Burst detection

    • Three-sigma rule to detect positive outliers (bursts)
  • Event Creation

  • Cluster selection

    • Find clusters that represent a new topic or change in topic around the entity, likely related to whatever caused the burst
    • Use centroid times (mean time of all documents in cluster) to identify event clusters and add them to events
    • Filter out older clusters (likely to be noise or background topics) and smaller clusters (less than 10 tweets)
    • If we can’t find clusters, the burst is probably caused by random noise or a background topic.
  • Event merging

    • Many events are about multiple entities, so we need to identify these links and combine the separate events into one
    • Only needs 1-way relationship: small events can link themselves to larger events
    • Split person names to improve effectiveness: “Barack Obama” → “barack” and “obama”
    • Check for possible merges after every tweet, and merge recursively
  • 预处理

    • 去除噪音、多余的推文(过滤
    • 过滤掉不需要的推文,例如广告
    • 解析和标记(部分语音标记(POS),命名实体识别(NER))
    • 去掉重复推文
  • 聚类

    • 最相似的推文总是讨论相同的实体
    • 对于我们提取它们所在的文档/推文的每个实体,将它们添加到这个倒排索引中,然后,推文被添加到它们包含的每个命名实体的索引中
    • 推文是按实体聚集的
  • 突发检测

    • 按时间间隔5, 10, 20, 40, 80, 160, 320 minutes 监测,移除320分钟(六小时左右)以上的推文,因为“A ~6 hour old tweet isn’t much use in a breaking news situation。”
    • Three-sigma rule:
  • 事件创建

    • 检测到burst事件,就停止更新实体频率信息
    • 直到实体频率下降到爆发值以下(+1 and 1/2 the window length)
  • 集群选择

    • 需要在实体周围找到代表新主题或主题变化的集群,可能与导致爆发的原因有关
    • 使用质心时间(集群中所有文档的平均时间)来识别事件集群并将它们添加到事件中
    • 过滤较旧的集群(可能是噪音或背景主题),较小的集群(10 条推文以下)
    • 如果找不到集群,则突发可能是由随机噪声或背景主题造成的。
  • 事件合并

    • 许多事件涉及多个实体需要
    • 识别链接将单独的事件合并为一个。
    • 只需要单向关系:小事件可以链接到更大的事件事件
    • 拆分人名以提高效率:“Barack Obama” → “barack”和“obama”
    • 在每条推文之后检查可能的合并,并递归合并

Evaluation of ENTITY-BASED Event detection

  • Precision
    The fraction of retrieved documents that are relevant to the query
    precision=ABprecision=\frac{A}{B}precision=BA​

  • Recall
    The fraction of the relevant documents that are successfully tested
    Recall=ARRecall=\frac{A}{R}Recall=RA​

  • F Measure
    Harmonic mean of precision and recall
    f=1α(1/P)+(1−α)(1/R)f=\frac{1}{\alpha(1/P)+(1-\alpha)(1/R)}f=α(1/P)+(1−α)(1/R)1​

LSH: Locality Sensitive Hashing
CS: Cluster Summarization

Web Science笔记 Emotion, Event detection相关推荐

  1. 阅读笔记——2019_004 A SURVEY OF TECHNIQUES FOR EVENT DETECTION IN TWITTER

    A SURVEY OF TECHNIQUES FOR EVENT DETECTION IN TWITTER 这篇文章是在阅读笔记003的参考文献中溯源而得,文章年限比较久了,但其中的一些事件检测技术还 ...

  2. 论文笔记 ACL 2021|Low-resource Event Detection with Ontology Embedding

    文章目录 1 简介 1.2 创新 2 方法 2.1 Event Detection (Ontology Population) 2.2 Event Ontology Learning 2.3 Even ...

  3. 论文笔记 EMNLP 2020|Edge-Enhanced Graph Convolution Networks for Event Detection with Syntactic Relation

    文章目录 1 简介 1.1 动机 1.2 创新 2 背景知识 3 方法 4 实验 1 简介 论文题目:Edge-Enhanced Graph Convolution Networks for Even ...

  4. 论文笔记 EMNLP 2021|Treasures Outside Contexts: Improving Event Detection via Global Statistics

    文章目录 1 简介 1.1 动机 1.2 创新 2 方法 2.1 语义特征提取器 2.2 统计特征提取器 3 实验 1 简介 论文题目:Treasures Outside Contexts: Impr ...

  5. 论文笔记 NAACL findings 2022|Zero-Shot Event Detection Based on Ordered Contrastive Learning and Prompt-

    文章目录 1 简介 1.1 动机 1.2 创新 2 方法 2.1 Contrastive sample generator 2.2 Event encoder 2.3 Ordered contrast ...

  6. 论文笔记 EMNLP 2021|Modeling Document-Level Context for Event Detection via Important Context Selection

    文章目录 1 简介 1.1 创新 2 方法 2.1 预测模型 2.2 上下文选择 2.3 训练 3 实验 1 简介 论文题目:Modeling Document-Level Context for E ...

  7. 《Word Sense Disambiguation Improves Event Detection via Neural Representation Matching》阅读笔记

    文章目录 一.motivation 二.method Pretrain + fine-tune ALT 说明:<Similar but not the Same: Word Sense Disa ...

  8. Sound Event Detection: A Tutorial 学习笔记

    原文链接 目录 一.日常环境中的声音世界检测 二.声音事件监测的挑战 三.通用的机器学习方法 四.数据 五.信号处理方法 A 数据增强 B 特征表示 六.SED 机器学习 A CRNN B 先进方法 ...

  9. 论文笔记 EMNLP 2018|Collective Event Detection via a Hierarchical and Bias Tagging Networks with Gated

    文章目录 1 简介 1.1 创新 2 背景知识 3 方法 4 实验 1 简介 论文题目:Collective Event Detection via a Hierarchical and Bias T ...

  10. 论文笔记 EMNLP 2021|Lifelong Event Detection with Knowledge Transfer

    文章目录 1 简介 1.1 创新 2 方法 2.1 baseline 2.2 新旧事件类型的知识迁移 3 实验 1 简介 论文题目:Lifelong Event Detection with Know ...

最新文章

  1. idea里maven设置本地仓库报错原因
  2. Github 项目推荐 | 用手势输入表情符号 —— Emojinator
  3. c语言五子棋卡死,五子棋程序出错了
  4. Exchange Server 2007迁移Exchange Server 2010 (16)--- OWA重定向
  5. 你离顶尖网络工程师有多远?
  6. python工程技巧_python 19个值得学习的编程技巧
  7. 【指南】远程抄表系统(AMR/AMI)中无线模块选型
  8. C#中采用OLEDB方式来读取EXCEL文件
  9. php 批量修改表格数据,PHP批量修改数据库表前缀教程+代码
  10. SSL 重点SSL会话步骤
  11. CS231n李飞飞计算机视觉 迁移学习之物体定位与检测下
  12. c#对PL/SQL查询结果列复制的结果生成指定格式
  13. C#使用TCP/UDP协议通信并用Wireshark抓包分析数据
  14. Bootstrap优秀模板-INSPINIA.2.9.2
  15. DNS中A记录和CNAME的区别 什么是CNAME
  16. 常用的网站建设程序有哪些?
  17. 听刘万祥老师讲“风险矩阵分析图”
  18. 简单做(ZTD)的十个好习惯总结
  19. 仿Excel冻结单元格效果
  20. Android平板获取唯一标识DeviceId

热门文章

  1. 爬取豆瓣短评之《后来的我们》-------后来的我们没有故事
  2. ios 代理和委托的区别
  3. NPS净推荐值 客户忠诚度指标
  4. csm和uefi_BIOS里的 CSM 是什么意思,我的只有UEFI
  5. 最小项标准式和卡诺图化简
  6. linux怎么看内存时序,内存速度和时序重要么
  7. CAN总线介绍及硬件设计
  8. H264中一些很有用的解释
  9. 音乐播放器app android,mp3音乐播放器
  10. 手把手 网络爬虫:用爬虫爬取贝壳房租网西安的租房信息