简介

与其他方法的比较

bag of words (BOW):不会考虑词语出现的先后顺序。

Latent Dirichlet Allocation (LDA):更偏向于从文中提取关键词和核心思想extracting topics/keywords out of texts,但是非常难调参数并且难以评价模型的好坏。

基石:word2vec

Word2vec 是一种计算效率特别高的预测模型,用于学习原始文本中的字词嵌入。它分为两种类型:连续词袋模型 (CBOW) 和 Skip-Gram 模型。从算法上看,这些模型比较相似,只是 CBOW 从源上下文字词(“the cat sits on the”)中预测目标字词(例如“mat”),而 skip-gram 从目标字词中预测源上下文字词。这种调换似乎是一种随意的选择,但从统计学上来看,它有助于 CBOW 整理很多分布信息(通过将整个上下文视为一个观察对象)。在大多数情况下,这对于小型数据集来说是很有用的。但是,skip-gram 将每个上下文-目标对视为一个新的观察对象,当我们使用大型数据集时,skip-gram 似乎能发挥更好的效果。

CBOW: Continuous bag of words creates a sliding window around current word, to predict it from “context” — the surrounding words. Each word is represented as a feature vector. After training, these vectors become the word vectors.

Doc2Vec形成

word2vec + document-unique feature vector = when training the word vectors W, the document vector D is trained as well, and in the end of training, it holds a numeric representation of the document.

Distributed Memory version of Paragraph Vector (PV-DM). It acts as a memory that remembers what is missing from the current context — or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document.

another algorithm, which is similar to skip-gram may be used Distributed Bag of Words version of Paragraph Vector (PV-DBOW).

参数

Parameters:
  • documents (iterable of list of TaggedDocument, optional) – Input corpus, can be simply a list of elements, but for larger corpora,consider an iterable that streams the documents directly from disk/network. If you don’t supply documents (or corpus_file), the model is left uninitialized – use if you plan to initialize it in some other way.
  • corpus_file (stroptional) – Path to a corpus file in LineSentence format. You may use this argument instead of documents to get performance boost. Only one of documents or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized). Documents’ tags are assigned automatically and are equal to line number, as in TaggedLineDocument.
  • dm ({1,0}optional) – Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
  • vector_size (intoptional) – Dimensionality of the feature vectors.
  • window (intoptional) – The maximum distance between the current and predicted word within a sentence.
  • alpha (floatoptional) – The initial learning rate.
  • min_alpha (floatoptional) – Learning rate will linearly drop to min_alpha as training progresses.
  • seed (intoptional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.
  • min_count (intoptional) – Ignores all words with total frequency lower than this.
  • max_vocab_size (intoptional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
  • sample (floatoptional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
  • workers (intoptional) – Use these many worker threads to train the model (=faster training with multicore machines).
  • epochs (intoptional) – Number of iterations (epochs) over the corpus.
  • hs ({1,0}optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
  • negative (intoptional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
  • ns_exponent (floatoptional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.
  • dm_mean ({1,0}optional) – If 0 , use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.
  • dm_concat ({1,0}optional) – If 1, use concatenation of context vectors rather than sum/average; Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.
  • dm_tag_count (intoptional) – Expected constant number of document tags per document, when using dm_concat mode.
  • dbow_words ({1,0}optional) – If set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; If 0, only trains doc-vectors (faster).
  • trim_rule (functionoptional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARDgensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:

    • word (str) - the word we are examining
    • count (int) - the word’s frequency count in the corpus
    • min_count (int) - the minimum count threshold.
  • callbacks – List of callbacks that need to be executed/run at specific stages during training.

踩坑记录

在最开始使用doc2vec的时候我只label了测试集和验证集的数据(因为这两个是已经被标注的数据分离出来的train_test_split)。结果在预测真正的测试集的数据时就出现了问题。doc2vec目前只能预测他已经见过的词,那些文本样例可以不需要标注,但是必须见过。个人认为这是doc2vec一个比较大的缺点。所以只能再将测试集数据label一遍。我这里说的label可不是对数据分类(不是该文本样例属于哪一个class),具体label代码见下。

from gensim.models import doc2vecdef label_sentences(corpus, label_type):"""Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it.We do this by using the TaggedDocument method. The format will be like "TRAIN_i" where "i" isa dummy index of the complaint narrative."""labeled = []for i, v in enumerate(corpus):label = label_type + '_' + str(i)labeled.append(doc2vec.TaggedDocument(v.split(), [label]))return labeled

接着,doc2vec处理文本之后拿到了一个二维的数组。后面接了一个普通的神经网络进行预测。但是发现效果并不好。就想着要不然把这个二维的数组喂给GNU/LSTM看看。当时GNU报错说expected dim=3但是只拿到了dim=2,于是我在参考了网上一些文档之后把x(文本样例),y(分类的标注)reshape分别成了(shape[0], shape[1], 1), 其实就是为了凑够三个维度。模型倒是能跑了。不过我发现loss根本就不下降。

其实doc2vec处理文本以后根本就不可以接GNU/LSTM。因为这两个是循环神经网络,需要序列性的模型,而doc2vec处理文本之后已经没有序列性了。

参考:

  • https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

doc2vec介绍和实践相关推荐

  1. 车牌识别算法介绍与实践(转)

    源: 车牌识别算法介绍与实践 转载于:https://www.cnblogs.com/LittleTiger/p/10101820.html

  2. RabbitMQ系列(三)RabbitMQ交换器Exchange介绍与实践

    RabbitMQ交换器Exchange介绍与实践 RabbitMQ系列文章 RabbitMQ在Ubuntu上的环境搭建 深入了解RabbitMQ工作原理及简单使用 RabbitMQ交换器Exchang ...

  3. Python内置四大数据结构之字典的介绍及实践案例

    Python字典的介绍及实践案例 一.字典(Dict)介绍 字典是Python内置的四大数据结构之一,是一种可变的容器模型,该容器中存放的对象是一系列以(key:value)构成的键值对.其中键值对的 ...

  4. 数据分析与挖掘中常用Python库的介绍与实践案例

    数据分析与挖掘中常用Python库的介绍与实践案例 一.Python介绍 现在python一词对我们来说并不陌生,尤其是在学术圈,它的影响力远超其它任何一种编程语言, 作为一门简单易学且功能强大的编程 ...

  5. Gensim介绍以及实践

    目录 前言 一.相关概念介绍以及拓展 1-0.安装 1-1.文档 1-2.语料库.数据预处理.词典的定义 1-3.向量 1-4.模型(以TF-idf为例)的初始化以及保存 二.文档相似度的计算 三.常 ...

  6. 基于Python的岭回归与LASSO回归模型介绍及实践

    基于Python的岭回归与LASSO回归模型介绍及实践 这是一篇学习的总结笔记 参考自<从零开始学数据分析与挖掘> [中]刘顺祥 著 完整代码及实践所用数据集等资料放置于:Github 岭 ...

  7. 基于Python的线性回归预测模型介绍及实践

    基于Python的线性回归预测模型介绍及实践 这是一篇学习的总结笔记 参考自<从零开始学数据分析与挖掘> [中]刘顺祥 著 完整代码及实践所用数据集等资料放置于:Github 线性回归预测 ...

  8. 关于Axure RP软件的介绍——软件工程实践第二次个人作业

    关于Axure RP软件的介绍--软件工程实践第二次个人作业 Axure RP是一个非常专业的快速原型设计的一个工具,客户提出需求,然后根据需求定义和规格.设计功能和界面的专家能够快速创建应用软件或W ...

  9. MPI介绍与实践——理论介绍

    MPI介绍与实践--理论介绍 一.MPI介绍 1.什么是MPI Message Passing Interface Specification(消息传递接口规范) MPI是由一组来自学术界和工业界的研 ...

最新文章

  1. java aop面向切面编程
  2. C++测试申请最大内存空间
  3. cxf客户端访问方式
  4. java 8u111 8u112_JDK 8U112
  5. Android Studio显示行数
  6. Java中转发(Forward)和重定向(Redirect)的区别
  7. Android 关于java.util.NoSuchElementException错误
  8. Screen Saver Exporter for Mac(屏幕保护程序导出程序)
  9. 测试淘宝购物流程图,梳理基本流和备选流,测试用例
  10. fn键台式计算机在哪,fn键在哪?老司机教你fn键使用方法
  11. python 操作excel 表格
  12. minecraft_如何轻松地在Minecraft版本之间切换
  13. java汉字转拼音或效验汉字是否存在(过滤)
  14. 【题解】【AcWing】1526. 洗牌机
  15. excel怎么设置自动计算_Excel财务表格大全!公式已设置好,数据自动计算产生...
  16. 年轻人的第一台挂灯:米家显示器挂灯
  17. java上传刷新_用ajaxfileupload实现无刷新上传的代码示例
  18. JS鼠标移入,移出事件
  19. IBM P570热插拔添加稳压模块
  20. Hive中的order by ,sort by ,distribute by , cluster by 区别

热门文章

  1. linux实现机器之间的账号免密登陆
  2. Ubuntu下安装Node.js
  3. 删除字符串最后一个字符的几种方法
  4. Spark分析之Standalone运行过程分析
  5. 我的MYSQL学习心得(十六) 优化
  6. CCIE实验之路由重分布之重分布的原则
  7. 详细解析用C#写的小游戏《彩色连珠》(附源代码)
  8. IE和Firefox在JS方面的不兼容及统一方法总结
  9. php phar 无法创建,php composer.phar install 安装问题
  10. 解决虚拟机VM 与 Device/Credential Guard 不兼容。在禁用 Device/Credential Guard 后,可以运行 VM 的方法