What Is Text Mining?

Marti Hearst

What is text mining? What are its potential applications and limitations?

Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation.

Text mining is different from what we're familiar with in web search. In search, the user is typically looking for something that is already known and has been written by someone else. The problem is pushing aside all the material that currently isn't relevant to your needs in order to find the relevant information.

In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down.

Text mining is a variation on a field called data mining, that tries to find interesting patterns from large databases. A typical example in data mining is using consumer purchasing patterns to predict which products to place close together on shelves, or to offer coupons for, and so on. For example, if you buy a flashlight, you are likely to buy batteries along with it. A related application is automatic detection of fraud, such as in credit card usage. Analysts look across huge numbers of credit card records to find deviations from normal spending patterns. A classic example is the use of a credit card to buy a small amount of gasoline followed by an overseas plane flight. The claim is that the first purchase tests the card to be sure it is active.

The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts . Databases are designed for programs to process automatically; text is written for people to read. We do not have programs that can "read" text and will not have such for the foreseeable future. Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do.

However, there is a field called computational linguistics (also known as natural language processing) which is making a lot of progress in doing small subtasks in text analysis. For example, it is relatively easy to write a program to extract phrases from an article or book that, when shown to a human reader, seem to summarize its contents. (The most frequent words and phrases in this article, minus the really common words like "the" are: text mining, information, programs, and example, which is not a bad five-word summary of its contents.)

There are programs that can, with reasonable accuracy, extract information from text with somewhat regularized structure. For example, programs that read in resumes and extract out people's names, addresses, job skills, and so on, can get accuracies in the high 80 percents.

I don't consider this to be text mining; rather it falls into an area called information extraction. However, I am a bit of a purist when it comes to defining what text mining is. I distinguish between what I call "real" text mining, that discovers new pieces of knowledge, from approaches that find overall trends in textual data.

An analogy I like to use comes from the realm of crime fighting. I think discovering new knowledge vs. showing trends is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft.

People are using the output of such programs to try to link together information in interesting ways. For example, one can extract all the names of people and companies that occur in news text surrounding the topic of wireless technology to try to infer who the players are in that field. There are a number of companies that are investigating this kind of application.

One problem with these approaches is that it is difficult to recognize which of the many relations that are shown are truly interesting. You'll immediately see who the big players are, but anyone who knows the business will already be aware of this. You'll also see many, many weak links between various players, hundreds or thousands of such links, and you can't tell which are the really interesting ones that you should pay attention to.

The most active, and I think promising, application area for text mining is in the biosciences . The best known example is Don Swanson's work on hypothesizing causes of rare diseases by looking for indirect links in different subsets of the bioscience literature.

As another example, one of the big current questions in genomics is which proteins interact with which other proteins. There has been notable success in looking at which words co-occur in articles that discuss the proteins in order to predict such interactions. The key is to not look for direct mentions of pairs, but to look for articles that mention individual protein names, keep track of which other words occur in those articles, and then look for other articles containing the same sets of words. This very simple method can yield surprisingly good results, even though the meaning of the texts are not being discerned by the programs. Rather, the text is treated like a "bag of words".

To get farther though we need more sophisticated language analysis. A number of us are working on statistical techniques that try to assign semantics, or meaning, to parts of the text. We break off pieces of the problem of analysis, targeted towards particular applications, rather than trying to "read" the articles as a whole. This goal is especially promising in the biosciences due to the nature of the text itself. In some ways it is easier to process automatically than ordinary text. It is less ambiguous and the processes it describes are somewhat mechanical, and so representable in a computer.

The fundamental limitations of text mining are first, that we will not be able to write programs that fully interpret text for a very long time , and second, that the information one needs is often not recorded in textual form . If I tried to write a program that detected when a where a new word came into existence and how it spread by analyzing web pages, I would miss important clues relating to usage in spoken conversations, email, on the radio and TV, and so on . Similarly , if I tried to write a program that processes published documents in order to guess what will happen to a bill in Washington DC, I would fail because most of the action still happens in negotiations behind closed doors.

Untangling Text Data Mining, Marti Hearst, ACL'99.

What Is Text Mining?相关推荐

  1. 《Mining Text Data》阅读笔记---第1章 An Introduction to Text Mining

    这是一本关于文本挖掘的很厚的英文电子书,看英文大部头,很容易边看边忘记. 1.An Introduction to Text Mining 1.1 介绍 文本挖掘的三个问题: a. 主要的算法模型是什 ...

  2. 《Text Mining and Analytics》学习笔记——第一周

    课程链接:https://www.coursera.org/learn/text-mining 主讲:伊利诺伊大学香槟分校 ChengXiang Zhai教授 NLP领域有哪些神一样的人物:知乎链接 ...

  3. 文本挖掘过程(Text Mining)

    一.文本挖掘概念 在现实世界中,可获取的大部信息是以文本形式存储在文本数据库中的,由来自各种数据源的大量文档组成,如新闻文档.研究论文.书籍.数字图书馆.电子邮件和Web页面.由于电子形式的文本信息飞 ...

  4. 论文阅读笔记:A Text Mining Approach for Evaluating Event Credibility on Twitter

    A Text Mining Approach for Evaluating Event Credibility on Twitter(一种在Twitter上评估事件可信度的文本挖掘方法) 期刊/会议: ...

  5. 用R進行中文 text Mining

    用R進行中文 text Mining 原文地址:http://rstudio-pubs-static.s3.amazonaws.com/12422_b2b48bb2da7942acaca5ace45b ...

  6. English Text Mining: Preprocessing 英文文本挖掘:文本预处理

    English Text Mining: Preprocessing 文章主干来自下面Reference中的博客,我自己进行了增加整理,感谢所有分享知识的大佬们= = 1. Data Collecti ...

  7. 文本分类与聚类(text categorization and clustering)

    1. 概述 广义的分类(classification或者categorization)有两种含义:一种含义是有指导的学习(supervised learning)过程,另一种是无指导的学习(unsup ...

  8. 限定域文本语料的短语挖掘(Phrase Mining)

    一只小狐狸带你解锁NLP/ML/DL秘籍 正文来源:丁香园大数据 前言 短语挖掘(Phrase Mining)的目的在于从大量的文本语料中提取出高质量的短语,是NLP领域中基础任务之一.短语挖掘主要解 ...

  9. data mining (foreign blogs)

    出处:http://blog.csdn.net/shuimuqingyi/article/details/8698607 国外数据挖掘方面的经典博客 总体感觉数据挖掘行业在国内尚没有收到足够重视,国内 ...

最新文章

  1. IIS 伪静态配置(安装ISAPI_Rewrite配置)
  2. SAP RETAIL 初阶之WA52 Allocation Rule List
  3. 19-7-15学习笔记
  4. before css 旋转_CSS及购物车的制作练习
  5. java使用poi读取word(简单,简约,直观)
  6. (二)UML语言概述
  7. 6个功能杰出的黑科技APP, 每一个都令你大开眼见!
  8. TX2-刷机完成后安装程序ubuntu_linux命令TX2学习总结
  9. 个人笔记上传 -- linux简单命令
  10. 自动化测试工程师,自动化测试项目老是误报?怎么解决?(详细总结)
  11. 使用JSONObject比较Java复杂对象
  12. 新浪微相册https外链图片无法调用解决方法
  13. Unity 相机固定角度平移至指定物体
  14. 企业为什么需要做APP安全评估?
  15. Python实现冒泡排序,从小到大输出(bubble)
  16. VirtualBox 网络主机模式(Host-Only)
  17. 2021最新壁纸小程序源码 壁纸小程序已去授权
  18. Discord多账号抢白名单,如何避免账号关联被封号?
  19. 泛函分析 02.02 赋范空间-完备的赋范空间
  20. 音乐相册怎么做?3步快速学会制作微信朋友圈的精美音乐相册效果

热门文章

  1. USTC English Club Note20171020(3)
  2. 云炬随笔20210930
  3. [云炬python3玩转机器学习笔记] 3-9Numpy中的arg运算
  4. 科学前进的车轮永不停歇 2018-04-28
  5. 第08课:深层神经网络(DNN)
  6. SVM熟练到精通2:SVM目标函数的dual优化推导
  7. 设置ubuntu12.04标题栏按钮
  8. 利用信号进行进程之间的通信
  9. 使用u-boot的tftp下载功能烧写程序到Nand Flash ——韦东山嵌入式Linux学习笔记09
  10. Http、Socket、WebSocket之间联系与区别