1. 文本表示:从one-hot到word2vec。
    1.1 词袋模型:离散、高维、稀疏。
    1.2 分布式表示:连续、低维、稠密。word2vec词向量原理并实践,用来表示文本。


    • 1. Word vectors
      • 1.1 传统词向量
        • 1.1.1 one-hot
        • 1.1.2 Bag-of-Words (BoW)
        • 1.1.2 TF-IDF
        • 1.1.3 Distributional Embeddings
      • 1.2 Neural Word Embeddings
        • 1.2.1 Word2Vec
          • a) Continuous bag-of-words (CBOW)
          • b) Continuous skip-gram
        • 1.2.2 GloVe
        • 1.2.3 FastText

1. Word vectors

看这里:Understanding word vectors

1.1 传统词向量

1.1.1 one-hot

最简单的词向量之一叫one-hot encoded vectors。
譬如,有一组词{magic, dragon, king, queen},用one-hot表示:

For high-cardinality variables — those with many unique categories — the dimensionality of the transformed vector becomes unmanageable.
The mapping is completely uninformed: “similar” categories are not placed closer to each other in embedding space.

1.1.2 Bag-of-Words (BoW)

The limitation of this method is that it results in extremely large feature dimensions and sparse vectors. But this model can still be used when you want to create a baseline model in just a few lines of code and when your dataset is small.

1.1.2 TF-IDF

Term Frequency – Inverse Document Frequency

Albeit these two techniques helping solve many problems in NLP, they still didn’t capture the true meaning of words.

1.1.3 Distributional Embeddings

1.2 Neural Word Embeddings

While building word embeddings, we aim to develop dense vector representations which somehow capture their meaning in the different contexts they were seen in in the documents. here

1.2.1 Word2Vec

Word2Vec was developed by Tomas Mikolov in 2013 at Google.

a) Continuous bag-of-words (CBOW)


b) Continuous skip-gram


See the full example code in: tensorflow_word2vec_basic, tensorflow_word2vec

1.2.2 GloVe

1.2.3 FastText



