Lecture 4 Text Classification
目录
- Classification 分类
- Text Classification Tasks 文本分类任务
- Topic Classification 主题分类
- Sentiment Analysis 情感分析
- Native-Language Identification 母语识别
- Natural Language Inference 自然语言推理
- Building a Text Classifier 构建文本分类器
- Choosing a Classification Algorithm 选择分类算法
- Naive Bayes 朴素贝叶斯
- Logistic Regression 逻辑回归
- Support Vector Machines 支持向量机
- K-Nearest Neighbor K-最近邻
- Decision Tree 决策树
- Random Forests 随机森林
- Neural Networks 神经网络
- Hyperparameter Tuning 超参数调优
- Confusion Matrix 混淆矩阵
- Evaluation Metrics 评估指标
Fundamentals of Classification
Classification 分类
Input:
- A document d: Often represented as a vector of features 一个文档 d:通常表示为特征向量
- A fixed output set of classes C = {c1, c2, …, ck}: Categorical, not continuous or ordinal 一个固定的输出类别集合 C = {c1, c2, …, ck}:类别是离散的,不是连续或有序的
Output:
- A predicted class c ∈ C 一个预测的类别 c ∈ C
Text Classification Tasks 文本分类任务
Some common examples:
- Topic classification 主题分类
- Sentiment analysis 情感分析
- Native-language identification 母语识别
- Natural language interence 自然语言推理
- Automatic fact-checking 自动事实检查
- Paraphrase 释义(paraphrase)
Input may not be a long document 输入可能不是一个长文档
Topic Classification 主题分类
Motivation: Library science, information retrieval 动机:图书馆科学,信息检索
Classes: Topic categories E.g. “jobs”, “international news” 类别:主题类别 例如"工作",“国际新闻”
Features:
- Unigram bag-of-words, with stop-words removed 一元词袋模型,移除了停用词
- Longer n-grams for phrases 长的n-grams表示短语
Examples of corpora:
- Reuters news corpus. E.g. RCV1, NLTK
- Pubmed abstracts
- Tweets with hashtags
Sentiment Analysis 情感分析
Motivation: Opinion mining, business analytics 动机:观点挖掘,商业分析
Classes: Positive/Negative/(Neutral) 类别:积极/消极/(中立)
Features:
- N-grams N-grams
- Polarity lexicons(极性词词典)
Examples of corpora:
- Movie review dataset in NLTK
- SEMEVAL Twitter polarity datasets
Native-Language Identification 母语识别
Motivation: Forensic linguistics, educational applications 动机:法证语言学,教育应用
Classes: First language of author 类别:作者的第一语言
Features:
- Word N-grams
- Syntactic patterns (POS, parse trees) 句法模式
- Phonological features 语音特征
Example of corpora:
- TOEFL/IELTS essay corpora
Natural Language Inference 自然语言推理
Also called textual entailment 也称为文本蕴含
Motivation: Language understanding 动机:语言理解
Classes: entailment, contradiction, neutral 类别:蕴含,矛盾,中立
Features:
- Word overlap 词重叠
- Length difference between the sentences 句子之间的长度差异
- N-grams N-grams
Examples of corpora:
- SNLI, MNLI
Building a Text Classifier 构建文本分类器
- Identify a task of interest 确定感兴趣的任务
- Collect an appropriate corpus 收集适当的语料库
- Carry out annotation 执行注释
- Select features 选择特征
- Choose a machine learning algorithm 选择机器学习算法
- Train model and tune hyperparameters using hold-out development data 使用留出法开发数据训练模型并调整超参数
- Repeat earlier steps as needed 根据需要重复前面的步骤
- Train fianl model 训练最终模型
- Evaluate model on hold-out test data 在留出测试数据上评估模型
Algorithms for Classification
Choosing a Classification Algorithm 选择分类算法
Bias vs. Variance 偏差与方差
- Bias: Assumptions made in the model 偏差:模型中做出的假设
- Variance: Sensitivity to training set 方差:对训练集的敏感度
Underlying assumptions E.g. Independence 潜在假设 例如 独立性
Complexity 复杂度
Speed 速度
Naive Bayes 朴素贝叶斯
Find the class with the highest likelihood under Bayes Law: 根据贝叶斯定理找到最有可能的类别
- Probability of the class times probability of features given the class 类别的概率乘以给定类别的特征概率
Naively assumes features are independent: 天真地假设特征是独立的
Pros:
- Fast to train and classify 训练和分类速度快
- Robust, low-variance -> good for low data situations 稳健,方差小 -> 对于数据量小的情况效果好
- Optimal classifier if independence assumption is correct 如果独立性假设正确,那么它是最优分类器
- Extremely simple to implement 实现极其简单
Cons:
Independence assumption rarely holds 独立性假设很少成立
Low accuracy compared to similar methods in most situations 在大多数情况下,与类似方法相比,精度较低
Smoothing required for unseen class/feature combinations 对于未见过的类别/特征组合,需要进行平滑处理
Logistic Regression 逻辑回归
A classifier, even called regression 一个分类器,甚至被称为回归
A linear model, but users softmax function squashing to get valid probability 一个线性模型,但使用 softmax 函数将其压缩成有效的概率
Training maximizes probability of training data subject to regularization which encourages low or sparse weights 训练过程最大化训练数据的概率,同时通过正则化来鼓励低权重或稀疏权重
Pros:
- Unlike Naive Bayes not confounded by diverse, correlated features gives better performance 与朴素贝叶斯不同,不会被多样性、相关特征所混淆,性能更好
Cons:
- Slow to train 训练速度慢
- Feature scaling needed 需要特征缩放
- Requires a lot of data to work well in practice 在实际应用中需要大量数据才能表现良好
- Choosing regularization strategy is important since overfitting is a big problem 选择正则化策略很重要,因为过拟合是一个大问题
Support Vector Machines 支持向量机
Finds hyperplane which separates the training data with maximum margin 找到一个超平面,该超平面最大限度地分离训练数据
Pros:
- Fast and accurate linear classifier 快速且准确的线性分类器
- Can do non-linearity with kernel trick 可以使用核技巧进行非线性处理
- Works well with huge feature sets 对于大规模特征集合工作良好
Cons:
- Multiclass classification awkward 多类别分类不方便
- Feature scaling needed 需要特征缩放
- Deals poorly with class imbalances 对类别不平衡处理不佳
- Interpretability 可解释性差
K-Nearest Neighbor K-最近邻
Classify based on majority class of k-nearest training examples in feature space 根据特征空间中最近的 k 个训练样本的多数类别进行分类
Definition of the nearest can vary: "最近"的定义可以变化
- Euclidean distance 欧几里得距离
- Cosine distance 余弦距离
Pros:
- Simple but surprisingly effective 简单但效果惊人
- No training required 无需训练
- Inherently multiclass 内置多类别
- Optimal classifier with infinite data 在无穷大的数据下是最优分类器
Cons:
- Have to select k 必须选择 k
- Issues with imbalanced classes 类别不平衡问题
- Often slow for finding the neighbors 在寻找最近邻时往往很慢
- Features must be selected carefully 特征必须仔细选择
Decision Tree 决策树
Construct a tree where nodes correspond to tests on individual features 构建一个树,其中节点对应于对单个特征的测试
Leaves are final class decisions 叶子节点是最终的类别决策
Based on greedy maximization of mutual information 基于贪心最大化互信息
Pros:
- Fast to build and test 建立和测试速度快
- Feature scaling irrelevant 特征缩放无关
- Good for small feature sets 对于小规模特征集合表现良好
- Handles non-linearly-separable problems 能处理非线性可分问题
Cons:
- In practice, not very interpretable 在实际中,解释性不强
- Highly redundant sub-trees 高度冗余的子树
- Not competitive for large feature sets 对于大规模特征集合,表现不强
Random Forests 随机森林
An ensemble classifier 一种集成分类器
Consists of decision trees trained on different subsets of the training and feature space 由在训练集和特征空间的不同子集上训练的决策树组成
Final class decision is majority voting of sub-classifiers 最终的类别决策是子分类器的多数投票
Pros:
- Usually more accurate and more robust than decision trees 通常比决策树更准确、更稳健
- Great classifier for medium feature sets 对于中等规模的特征集合是很好的分类器
- Training easily parallelized 训练过程可以轻易并行化
Cons:
- Interpretability 可解释性
- Slow with large feature sets 在大规模特征集合中训练慢
Neural Networks 神经网络
An interconnected set of nodes typically arranged in layers 一个由多个节点相互连接的集合,通常按层排列
Input layer(features), output layer(class probabilities), and one or more hidden layers 输入层(特征),输出层(类别概率),以及一个或多个隐藏层
Each node performs a linear weighting of its inputs from previous layer, passes result through activation function to nodes in next layer 每个节点对其来自上一层的输入进行线性加权,然后通过激活函数将结果传递给下一层的节点
Pros:
- Extremely powerful, dominant method in NLP and computer vision 极其强大,是 NLP 和计算机视觉中的主导方法
- Little feature engineering 很少需要特征工程
Cons:
- Not an off-the-shelf classifier 不是开箱即用的分类器
- Many hyperparameters, difficult to optimize 多个超参数,难以优化
- Slow to train 训练慢
- Prone to overfitting 易于过拟合
Hyperparameter Tuning 超参数调优
Dataset for tuning: 调优的数据集
- Development set 开发集
- Not the training set or the test set 不是训练集或测试集
- k-fold cross-validation k-折交叉验证
Specific hyper-parameters are classifier specific. But many hyper-parameters relate to regularization 具体的超参数是分类器特定的。但许多超参数与正则化有关
- Regularization hyperparameters penalize model complexity 正则化超参数惩罚模型复杂性
- Used to prevent overfitting 用于防止过拟合
For multiple hyperparameters, use grid search 对于多个超参数,使用网格搜索
Evaluation
Confusion Matrix 混淆矩阵
Classified As | ||
---|---|---|
Class | A | B |
A | True Positive | False Positive |
B | False Negative | True Negative |
Evaluation Metrics 评估指标
- Accuracy = True Positive / (True Positive + False Positive + True Negative + False Negative) 准确率 = 真正例 / (真正例 + 假正例 + 真负例 + 假负例)
- Precision = True Positive / (True Positive + False Positive) 精确度 = 真正例 / (真正例 + 假正例)
- Recall = True Positive / (True Positive + False Negative) 召回率 = 真正例 / (真正例 + 假负例)
- F1-score = (2 * precision * recall) / (precision + recall) F1-分数 = (2 * 精确度 * 召回率) / (精确度 + 召回率)
- Macroaverage: Average F-score across classes 宏平均:所有类别的 F-分数的平均值
- Micreaverage: Calculate F-score using sum of counts 微平均:使用计数和来计算 F-分数
Lecture 4 Text Classification相关推荐
- Bag of Tricks for Efficient Text Classification论文阅读及实战
本文目录 一.Fasttext算法综述 二.原理介绍及优化策略 三.Fasttext算法实战(注:以下代码仅在Linux系统下使用!) 四.参考资料 本文目录(仅做浏览用,暂时不支持页面内跳转) 一. ...
- 【多标签文本分类】Semantic-Unit-Based Dilated Convolution for Multi-Label Text Classification
·阅读摘要: 在本文中,作者基于Seq2Seq模型,提出多级膨胀卷积.混合注意力机制两点来加以改进,应用于多标签文本分类,提高了效果. ·参考文献: [1] Semantic-Unit-Bas ...
- 【多标签文本分类】Initializing neural networks for hierarchical multi-label text classification
·阅读摘要: 本文在<Improved Neural Network-based Multi-label Classification with Better Initialization ...
- 【多标签文本分类】Large Scale Multi-label Text Classification with Semantic Word Vectors
·阅读摘要: 本文提出了利用词向量+CNN/词向量+GRU来解决大规模数据下的多标签文本分类问题. [1] Large Scale Multi-label Text Classificatio ...
- 【文本分类】Recurrent Convolutional Neural Networks for Text Classification
·摘要: 从模型的角度,本文作者将RNN(Bi-LSTM)和max_pooling结合使用,提出RCNN模型,应用到了NLP的文本分类任务中,提高了分类精度. ·参考文献: [1] Recur ...
- 【文本分类】A C-LSTM Neural Network for Text Classification
·摘要: 本文作者将CNN和RNN(LSTM)结合使用,应用到了NLP的文本分类任务中. ·参考文献: [1] A C-LSTM Neural Network for Text Classif ...
- 【论文复现】Character-level Convolutional Networks for Text Classification
写在前面 今天讨论的论文依然是文本分类主题的.Character-level Convolutional Networks for Text Classification这篇论文是在2016年4月份发 ...
- 论文列表——text classification
https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...
- 小样本学习记录————利用所有数据的元学习Few-shot Text Classification with Distributional Signatures
小样本学习记录----利用所有数据的元学习Few-shot Text Classification with Distributional Signatures 在计算机视觉中,低水平的模式是可以跨学 ...
最新文章
- 整数n的倒数第k个数字
- hihocoder1718 最长一次上升子序列
- html5新年网页做给父母的,春节回家,要陪父母做这十件小事
- js表单验证控制代码大全
- 高中物理公式、规律汇编表
- Sorry,关注这些 IT 技术类公众号,真的可以为所欲为
- 杜教筛模板(P4213 【模板】杜教筛(Sum))
- 6号板编译失败找不到arm-none-linux-gnueabi-gcc
- 【bfs】调酒壶里的酸奶
- 神州数码笔试题C语言,神州数码程序类笔试题分享
- vue.js computedmethod
- 【数字信号处理】基于matlab数字信号同步压缩变换【含Matlab源码 1534期】
- R语言-层次分析法--AHP
- Eclipse 新手使用教程
- 固态硬盘在IDE、AHCI模式下的速度对比
- 《Qt 实战一二三》
- 摄影构图的几种基本方法
- oracle的dual用法
- 因果推断 之 初介绍 + 案例分析
- 【清橙 A1206】小Z的袜子(莫队算法)
热门文章
- 使用RSS地址订阅微信公众账号的文章~
- 英语中常见100个句型
- java file size 单位_file.getsize 单位
- Windows7中出现的KERNELBASE.dll错误的解决方法
- 卡尔曼滤波simulink模型_电池管理系统---Simulink源码/资料分享及Live 介绍
- hexo搭建自己的博客并部署至免费的github服务器教程
- 《Nature》发布药物筛选新突破:一种有效的方法来发现新的抗膜蛋白的配体和抑制剂
- 「古老」茶产业碰上「年轻」区块链,能否擦出新火花?
- Echarts表格数据转换工具
- SQL语句中不等号(!=,)