“Automatic text classification (TC) technology can efficiently organize and categorize text that increasing dramatically [1], thus it eliminates a large amount of human effort [2] and attracted a wide attention in recent years [3,4].”
自动文本分类技术能够有效地对急剧增加的文本进行组织和分类[1],从而省去了大量的人工工作量[2],近年来受到广泛关注[3,4]。

“The goal of the TC task is to categorize unlabeled texts into a predefined class based on their topics [5], and hence based on a set of prelabeled texts, an automatic text classifier can be established in the learning process [6,7].”
TC任务的目标是根据主题将未标注的文本分类到预定义的类别中[5],从而在学习过程中基于一组预先标注的文本,建立自动文本分类器[6,7]。

“Before applying classifiers, every term in the text first need to be assigned numerical values (weights) in an appropriate term weighting scheme that is called text representation [8,9]. The vector space model (VSM) is the most popular way to represent texts [10,11], it usually treats a text as a set of terms namely bag of words (BoW) model[12,13].”
在应用分类器之前,文本中的每个词首先需要在一个被称为文本表示的适当的词加权方案中分配数值(权重)[8,9]。向量空间模型(VSM)是表示文本最常用的方法[10,11],它通常把文本看作一组单词的集合,即词袋模型(BoW)[12,13]。

“ VSM constructs the text collections as a document-term matrix, in which the term weight represents importance of a certain term tj in a certain document dk[14,15], and each row denotes one of the document vectors, whereas each column corresponds to one of the distinct terms (i.e., selected features).”
VSM将文本集构造为一个文档-项矩阵,其中项权表示某一项tj在某一文档dk中的重要性[14,15],并且每一行表示一个文档向量,而每一列对应于一个不同的项(即选定的特征)。

“Term weighting is critical to TC task that has a direct and significant effect on the text classification performance[16].”
词项加权是TC任务的关键,它对文本分类性能有直接而显著的影响[16]。

“At present, term weighting approaches are generally grouped into unsupervised and supervised according to whether they embrace the class information of training texts[17,18].The unsupervised term weighting (UTW) methods neglects the diversity of class information, whereas the supervised term weighting (STW) methods exploit the category information when calculating the weight. For UTW schemes, such as term frequency (TF) and TF–IDF (term frequency–inverse document frequency) are commonly used. Among them, TF is one of the simplest weighting methods, but it is a local weighting approach due to it only considers how many times a term occurs in the text. To conquer this drawback, an inverse document frequency (IDF) has been designed to generate the TF–IDF scheme, it is concerned with how many texts a term has appeared in. Note that, TF–IDF has been primarily designed for information retrieval (IR) rather than TC tasks[10,19].”
目前,词项加权方法根据是否包含训练文本的类别信息,一般分为无监督方法和有监督方法[17,18]。无监督词项加权(UTW)方法忽略了类别信息的多样性,而有监督词项加权(STW)方法在计算权重时利用了类别信息。在UTW方案中,常用的是词频(TF)和词频逆文档频率(TF-IDF)。其中TF是一种最简单的加权方法,但它是一种局部加权方法,因为它只考虑一个词在文本中出现的次数。为了克服这一缺点,逆文档频率(IDF)被设计出来用来生成TF-IDF方案,它与一个词项出现在了多少个文本中有关。注意,TF-IDF主要是为信息检索(IR)而不是TC任务设计的[10,19]。

“Different from IR task, TC task aims to discriminate between different classes, not texts [20], and thus it should take category factors into account when computing the weight of terms. For that reason, it can be said that TC is a supervised learning task [9,17,21,22].”
与IR任务不同的是,TC任务的目标是区分不同类别,而不是文本[20],因此在计算词的权重时应该考虑类别因素。因此,TC可以说是一个监督学习任务[9,17,21,22]。

“Recently, most STW methods are originated from feature selection schemes, these methods adopt the category information in several ways, and can be summarized as follows. First of all, TF-CHI2, TF-IG and TF-GR have been proposed based on feature selection approaches (i.e., Chi-square statistic (CHI2), information gain (IG) and gain ratio (GR)) [18]. Since then, various STW methods similar to the above schemes have also been presented, for example, odds ratio (OR) weighting factor in TF-OR [14,23,24], mutual information (MI) weighting factor in TF-MI [14,23], probability-based (PB) weighting factor in TF-PB [24], and correlation coefficient weighting factor in TF-CC [24].”
目前,STW方法大多起源于特征选择方案,这些方法以多种方式采用类别信息,归纳如下。首先,提出了基于特征选择方法的TF-CHI2、TF-IG和TF-GR(即,卡方统计(CHI2)、信息增益(IG)、增益比(GR))[18]。此后,类似于上述方案的各种STW方法也相继出现,如TF-OR中的比值比(OR)权重因子[14,23,24],TF-MI中的交互信息(MI)权重因子[14,23],TF-BP中的概率(PB)权重因子[24],以及TF-CC中的相关系数权重因子[24]。

“Apart from these schemes, a variety of STW schemes have been built and proposed, which are derived from TF–IDF. Initially,
inspired by the IDF in TF–IDF scheme, the inverse class frequency (ICF) has been introduced, it indicates that a key term of the specific class usually appears in only a few categories [25].However, due to the number of categories is generally quite small, a certain term may occasionally exist in multiple categories or sometimes even in all categories [25,26]. As a result, ICF fails to promote the degree of importance of a term under certain circumstances. To enhance a term’s distinguishing power, the ICF has been incorporated to generate the TF–IDF–ICF scheme [14,25].”

除了这些方案,各种各样的STW方案也被建立和提出,它们都源自TF-IDF。最初,受TF-IDF方案中的IDF的启发,引入了逆类频数(inverse class frequency, ICF),它表明特定类的关键项通常只出现在少数几个类别中[25]。但是,由于类别的数量一般都比较少,某个词项有时可能存在于多个类别中,有时甚至存在于所有类别中[25,26]。因此,ICF在一定情况下并不能提升词项的重要程度。为了增强词项的区分能力,ICF被纳入TF-IDF-ICF方案[14,25]。

参考资料:
[1] H. Al-Mubaid, S.A. Umair, A new text categorization technique using distributional clustering and learning logic, IEEE Trans. Knowl. Data Eng. 18 (9) (2006) 1156–1165.
[2] Z. Erenel, H. Altınçay, Nonlinear transformation of term frequencies for term weighting in textcategorization, Eng. Appl. Artif. Intell. 25 (7) (2012) 1505–1514.
[3] C.X. Shang, M. Li, S.Z. Feng, et al., Feature selection via maximizing global information gain for text classification, Knowl.-Based Syst. 54 (2013) 298–309.
[4] E.S. Tellez, D. Moctezuma, S. Miranda-Jiménez, et al., An automated text categorization framework based on hyperparameter optimization, Knowl.-Based Syst. 149 (2018) 110–123.
[5] F. Sebastiani, Machine learning in automated text categorization, Acm Comput. Surv. 34 (1) (2002) 1–47.
[6] B. Tang, S. Kay, H.B. He, Toward optimal featureselection in Naive Bayes for text categorization, IEEE Trans. Knowl. Data Eng. 28 (9) (2016) 2508–2521.
[7] M. Haddoud, A. Mokhtari, T. Lecroq, et al., Combining supervised term-weighting metrics for SVM text classification with extended term representation, Knowl. Inf. Syst. 49 (3) (2016) 909–931.
[8] Z.C. Li, Z.Y. Xiong, Y.F. Zhang, et al., Fast text categorization using concise semantic analysis, Pattern Recognit. Lett. 32 (3) (2011) 441–448.
[9] D.Q. Wang, H. Zhang, Inverse-category-frequency based supervised term weighting schemes for text categorization, J. Inf. Sci. Eng. 29 (2) (2013) 209–225.
[10] G. Salton, A. Wong, C.S. Yang, A vector space model for automatic indexing, Commun. ACM 18 (11) (1974) 613–620.
[11] M. Melucci, Vector-spacemodel, in: L. Liu, M.T. ÖZsu (Eds.), Encyclopedia of Database Systems, Springer US, Boston, MA, 2009, pp. 3259–3263.
[12] Ş. Taşcı, T. Güngör, Comparison of text feature selection policies and using an adaptive framework, Expert Syst. Appl. 40 (12) (2013) 4871–4886.
[13] H.T. Nguyen, P.H. Duong, E. Cambria, Learning short-text semantic similarity with word embeddings and external knowledge sources, Knowl.-Based Syst. 182 (2019) 104842.
[14] F.J. Ren, M.G. Sohrab, Class-indexing-based term weighting for automatic text classification, Inform. Sci. 236 (1) (2013) 109–125.
[15] D.S. Guru, M. Suhil, L.N. Raju, et al., An alternative framework for univariate filter based feature selection for text categorization, Pattern Recognit. Lett. 103 (2018) 23–31.
[16] I. Alsmadi, G.K. Hoon, Term weighting scheme for short-text classification: Twitter corpuses, Neural Comput. Appl. 31 (8) (2019) 3819–3831.
[17] M. Lan, C.L. Tan, J. Su, et al., Supervised and traditional term weighting methods for automatic text categorization, IEEE Trans. Pattern Anal. Mach. Intell. 31 (4) (2009) 721–735.
[18] F. Debole, F. Sebastiani, Supervised term weighting for automated text categorization. In Proceedings of the ACM Symposium on Applied Computing, 2003, 784-788.
[19] K.S. Jones, A statistical interpretation of term specificity and its application in retrieval, J. Doc. 28 (1) (1972) 11–21.
[20] H.B. Wu, X.D. Gu, Y.W. Gu, Balancing between over-weighting and under-weighting in supervised term weighting, Inf. Process. Manage. 53 (2) (2017) 547–557.
[21] H.J. Escalante, M.A. García-Limón, A. Morales-Reyes, et al., Term-weighting learning via genetic programming for text classification, Knowl.-Based Syst. 83 (2015) 176–189.
[22] K.W. Chen, Z.P. Zhang, J. Long, et al., Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert Syst. Appl. 66 (33) (2016) 245–260.
[23] H. Altınçay, Z. Erenel, Analytical evaluation of term weighting schemes for text categorization, Pattern Recognit. Lett. 31 (11) (2010) 1310–1323.
[24] Y. Liu, H.T. Loh, A. Sun, Imbalanced text classification: A term weighting approach, Expert Syst. Appl. 36 (1) (2009) 690–701.
[25] V. Lertnattee, T. Theeramunkong, Analysis of inverse class frequency in centroid-based text classification, in: Proceedings of the 4th International Symposium on Communication and Information Technology, 2004, pp. 1171-1176.
[26] X.J. Quan, W.Y. Liu, B.T. Qiu, Term weighting schemes for question categorization, IEEE Trans. Pattern Anal. Mach. Intell. 33 (5) (2011) 1009–1021.

Several alternative term weighting methods for text representation and ~~ ——1. Introduction 引言相关推荐

  1. Several alternative term weighting ~~ ——3. Proposed unsupervised term weighting schemes 提出的无监督词项加权方案

    "It should be claimed that choose an appropriate metric function used for weighting terms is th ...

  2. Deep Unordered Composition Rivals Syntactic Methods for Text Classification(简摘DAN模型)

    Deep Unordered Composition Rivals Syntactic Methods for Text Classification(简摘) 摘要 成果 模型结构 结论 摘要 Man ...

  3. es 中关于 term,match, text, keyword

    转自:https://blog.csdn.net/qq_38043440/article/details/101678677 最近项目中使用了ElasticSearch, 在使用基本的查询功能的时候, ...

  4. Text Classification Algorithms: A Survey——1. Introduction引言

    "Most text classification and document categorization systems can be deconstructed into the fol ...

  5. 文本表征 Text Representation

    基于 one-hot.tf-idf.textrank 等的 bag-of-words: 主题模型:LSA(SVD).pLSA.LDA: 基于词向量的固定表征:Word2vec.FastText.Glo ...

  6. 文本表示(Text Representation)之词集模型(SOW)词袋模型(BOW)TF-IDF模型

    转载请注明来源 http://blog.csdn.net/Recall_Tomorrow/article/details/79488639 欢迎大家查看这些模型简单实现的代码--     \ \ \ ...

  7. Deep Unordered Composition Rivals Syntactic Methods for Text Classification

    Abstract 很多 NLP 领域的深度学习模型集中于研究输入的组合性,这需要很大的计算量. 我们提出一种简单的深度神经网络模型,在情感分析和事实类问答任务上性能可以比肩(甚至在某些情况下还能超过) ...

  8. Adversarial Decomposition of Text Representation翻译

    摘要 在本文中,我们提出了一种对文本表示进行对抗分解方法.该方法可用于将输入句子的表示分解为几个独立的向量,每个向量负责输入句子的特定方面.我们在两个研究案例中评估了所提出的方法:不同社会语录和时代语 ...

  9. 医学自然语言处理(NLP)相关论文汇总之 EMNLP 2021

    医学自然语言处理(NLP)相关论文汇总之 EMNLP 2021 [写在前面]EMNLP2021前段时间已经放榜,一直没时间整理,最近抽时间整理了一下该会议在医疗自然语言处理方向上的相关论文,放在这里, ...

  10. CDS (W2) -- Features, Data, Text Processing

    Features, Data, Text Processing 1. Features Examples of Features e.g. Home Type, Material Status, In ...

最新文章

  1. ISA Server 2006 安全保障指南
  2. python列表各种切片姿势
  3. ORACLE RAC 重新安装时清空ASM 磁盘命令
  4. Oracle新建用户并授权
  5. 顺利搭建了oracle
  6. Magicodes.SwaggerUI 已支持.NET Core 3.1
  7. java 类 null_深入理解java中的null“类型”
  8. 物联网技术系列之3分钟了解无线MESH网络
  9. QML Profiler性能优化教程
  10. Java RandomAccessFile getFilePointer()方法与示例
  11. TensorFlow第十步CNN BP 编程求解
  12. pandas 表操作
  13. 如何使用jQuery打开Bootstrap模式窗口?
  14. [LintCode] 翻转二叉树
  15. apicloud链接访问本地数据库
  16. iOS8 web下载ipa install App via OTA
  17. 计算机网络--物理层(全)
  18. Java抽象类与接口详解
  19. 新年祝大家乐一乐,牛年旺旺,发财发财
  20. python 环境配置

热门文章

  1. php ip138获取,php通过调用ip138数据库获取IP及网络类型
  2. “奋斗者”号下潜10909米:我们为什么要做深海探索?
  3. 关于计算机科学的publication(zz南大小百合)
  4. 教你快速识别网络项目的骗术
  5. proto_path passed empty directory name. (Use “.“ for current directory.)
  6. linux怎样编写脚本文档,Linux下批处理文件编写
  7. shader篇-渲染纹理
  8. 邓元鋆:AMD的方法论
  9. Linux 服务器配置 ASF 云挂卡
  10. 瞻博QFX5100系列交换机光模块解决方案