word2vec python实现_word2vec的几种实现

写在前面

态度决定高度！让优秀成为一种习惯！

世界上没有什么事儿是加一次班解决不了的，如果有，就加两次！(- - -茂强)

word2vec

大名鼎鼎的word2vec在这里就不再解释什么了，多说无益，不太明白的就去百度google吧，下面就说一下各种实现吧

准备预料

预料

python-gensim

一个简单到爆的方式，甚至可以一行代码解决问题。

from gensim.models import word2vec

sentences = word2vec.Text8Corpus("C:/traindataw2v.txt") # 加载语料

model = word2vec.Word2Vec(sentences, size=200) # 训练skip-gram模型; 默认window=5

#获取“学习”的词向量

print("学习：" + model["学习"])

# 计算两个词的相似度/相关程度

y1 = model.similarity("不错", "好")

# 计算某个词的相关词列表

y2 = model.most_similar("书", topn=20) # 20个最相关的

# 寻找对应关系

print("书-不错，质量-")

y3 = model.most_similar(['质量', '不错'], ['书'], topn=3)

# 寻找不合群的词

y4 = model.doesnt_match("书书籍教材很".split())

# 保存模型，以便重用

model.save("db.model")

# 对应的加载方式

model = word2vec.Word2Vec.load("db.model")

好了，gensim的方式说完了

下边就让我们看一下参数吧

默认参数如下：

sentences=None

size=100

alpha=0.025

window=5

min_count=5

max_vocab_size=None

sample=1e-3

seed=1

workers=3

min_alpha=0.0001

sg=0

hs=0

negative=5

cbow_mean=1

hashfxn=hash

iter=5

null_word=0

trim_rule=None

sorted_vocab=1

batch_words=MAX_WORDS_IN_BATCH

是不是感觉很意外，为啥有这么多参数，平时都不怎么用，但是，一个训练好的模型的好与坏与其参数密不可分，之所以代码把这些参数开放出来，是有一定的意义的，下面就让我们来一一的看一下各个参数的意义在哪里吧。

sentences：就是每一行每一行的句子，但是句子长度不要过大，简单的说就是上图的样子

sg：这个是训练时用的算法，当为0时采用的是CBOW算法，当为1时会采用skip-gram

size：这个是定义训练的向量的长度

window：是在一个句子中，当前词和预测词的最大距离

alpha：是学习率，是控制梯度下降算法的下降速度的

seed：用于随机数发生器。与初始化词向量有关

min_count：字典截断.，词频少于min_count次数的单词会被丢弃掉

max_vocab_size：词向量构建期间的RAM限制。如果所有不重复单词个数超过这个值，则就消除掉其中最不频繁的一个,None表示没有限制

sample：高频词汇的随机负采样的配置阈值，默认为1e-3，范围是(0,1e-5)

workers：设置多线程训练模型，机器的核数越多，训练越快

hs：如果为1则会采用hierarchica·softmax策略，Hierarchical Softmax是一种对输出层进行优化的策略，输出层从原始模型的利用softmax计算概率值改为了利用Huffman树计算概率值。如果设置为0(默认值)，则负采样策略会被使用

negative：如果大于0，那就会采用负采样，此时该值的大小就表示有多少个“noise words”会被使用，通常设置在(5-20)，默认是5，如果该值设置成0，那就表示不采用负采样

cbow_mean：在采用cbow模型时，此值如果是0，就会使用上下文词向量的和，如果是1(默认值)，就会采用均值

hashfxn：hash函数来初始化权重。默认使用python的hash函数

iter：迭代次数，默认为5

trim_rule：用于设置词汇表的整理规则，指定那些单词要留下，哪些要被删除。可以设置为None(min_count会被使用)或者一个接受(word, count, min_count)并返回utils.RULE_DISCARD，utils.RULE_KEEP或者utils.RULE_DEFAULT，这个设置只会用在构建词典的时候，不会成为模型的一部分

sorted_vocab：如果为1(defau·t)，则在分配word index 的时候会先对单词基于频率降序排序。

batch_words：每一批传递给每个线程单词的数量，默认为10000，如果超过该值，则会被截断

python-tensorflow

官方网站实现的是n-gram方式

cbow和skip-gram

Skip-Gram是给定input word来预测上下文。而CBOW是给定上下文，来预测input word

首先数据还是上边的数据

读取数据

words = []

with open("c:/traindatav.txt", "r", encoding="utf-8") as f:

for line in f.readlines():

text = line.split(" => ")

if len(text) == 2:

lable = text[0].strip()

listsentence = [w for w in text[1].split(" ") if re.match("[\u4e00-\u9fa5]+", w) and len(w) >= 2]

words.extend(listsentence)

words存放单词，这里单词都是按照顺序进入words里边的

构建词典

vocabulary_size = 10000

def build_dataset(words):

count = [['UNK', -1]] count.extend(collections.Counter(words).most_common(vocabulary_size - 1))

dictionary = dict()

for word, _ in count:

dictionary[word] = len(dictionary)

data = list()

unk_count = 0

for word in words:

if word in dictionary:

index = dictionary[word]

else:

index = 0 # dictionary['UNK']

unk_count += 1

data.append(index)

count[0][1] = unk_count

reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))

return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)

vocabulary_size声明了词典里边用多少单词填充，其余的都用UNK填充，

这里筛选单词的条件是词频，当然这里如果有好的想法也可以自行改进，比如去头除尾，词频太高的也不要，词频太低的也不要，我这里选择了10000歌词去训练

其中dictionary中存放的数据如下图

dictionary

这里边的数据表示为每个词标注一个索引

其中data里边存放的数据如下图

data

这里边的数数字标识了words里边词的对应的索引，数据都是从上边的dictionary中取出来的

其中count表示的是词频统计，如下图

count

reverse_dictionary表示的是dictionary的反转

reverse_dictionary

参数声明

batch_size = 128

embedding_size = 128 # Dimension of the embedding vector.

skip_window = 1 # How many words to consider left and right.

num_skips = 2 # How many times to reuse an input to generate a label.

# We pick a random validation set to sample nearest neighbors. Here we limit the

# validation samples to the words that have a low numeric ID, which by

# construction are also the most frequent.

valid_size = 16 # Random set of words to evaluate similarity on.

valid_window = 100 # Only pick dev samples in the head of the distribution.

valid_examples = np.random.choice(valid_window, valid_size, replace=False)

num_sampled = 64 # Number of negative examples to sample.

构建skip-gram模型的迭代函数

data_index = 0

def generate_batch(batch_size, num_skips, skip_window):

global data_index

assert batch_size % num_skips == 0

assert num_skips <= 2 * skip_window

batch = np.ndarray(shape=(batch_size), dtype=np.int32)

labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)

span = 2 * skip_window + 1 # [ skip_window target skip_window ]

buffer = collections.deque(maxlen=span)

for _ in range(span):

buffer.append(data[data_index])

data_index = (data_index + 1) % len(data)

for i in range(batch_size // num_skips):

target = skip_window # target label at the center of the buffer

targets_to_avoid = [skip_window]

for j in range(num_skips):

while target in targets_to_avoid:

target = random.randint(0, span - 1)

targets_to_avoid.append(target)

batch[i * num_skips + j] = buffer[skip_window]

labels[i * num_skips + j, 0] = buffer[target]

buffer.append(data[data_index])

data_index = (data_index + 1) % len(data)

return batch, labels

其中batch = np.ndarray(shape=(batch_size), dtype=np.int32)是产生一个128维的向量， labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)时产生128*1的一个矩阵，buffer里边存放的是选出来的一个窗口上下文词的索引，数据来源于data，data_index全局标识words的索引，也就是data的每一个值，其作用是为了在每一次迭代的过程中平滑的去产生上下文窗口。

buffer上下文

一个叫做skip_window的参数，它代表着我们从当前input word的一侧(左边或右边)选取词的数量。num_skips，它代表着我们从整个窗口中选取多少个不同的词作为我们的output word

构建计算图

graph = tf.Graph()

with graph.as_default():

# Input data.

train_inputs = tf.placeholder(tf.int32, shape=[batch_size])

train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Ops and variables pinned to the CPU because of missing GPU implementation

with tf.device('/cpu:0'):

# Look up embeddings for inputs.

embeddings = tf.Variable(

tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# Construct the variables for the NCE loss

nce_weights = tf.Variable(

tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))

nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Compute the average NCE loss for the batch.

# tf.nce_loss automatically draws a new sample of the negative labels each

# time we evaluate the loss.

loss = tf.reduce_mean(

tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, inputs=embed, labels=train_labels, num_sampled = num_sampled, num_classes=vocabulary_size))

# Construct the SGD optimizer using a learning rate of 1.0.

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

# Compute the cosine similarity between minibatch examples and all embeddings.

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))

normalized_embeddings = embeddings / norm

valid_embeddings = tf.nn.embedding_lookup(

normalized_embeddings, valid_dataset)

similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

# Add variable initializer.

init = tf.global_variables_initializer()

首先声明数据placeholder，train_inputs【128】，train_labels【128x1】，然后声明valid_dataset，这个是存放词频相对比较高一些有效词，主要是为了训练结束后计算这些词的相似词

embeddings【10000x128】的词向量矩阵，embed要训练批次对应的词向量矩阵，nce_weights表示nce损失下的权重矩阵，tf.truncated_normal()产生的是一个截尾的正态分布，nce_biases表示偏置项，loss就是损失函数，也就是目标函数，optimizer表示的是迭代优化随机梯度下降法，用以优化loss函数，步长为1.0，similarity是为了根据embeddings计算valid_dataset中存放的词的相似度

大概的神经网络图如图，知道原理即可，图也是借来的

神经网络图

开始迭代计算

num_steps = 100001

with tf.Session(graph=graph) as session:

# We must initialize all variables before we use them.

init.run()

print("Initialized")

average_loss = 0

for step in range(num_steps):

batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)

feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

# We perform one update step by evaluating the optimizer op (including it

# in the list of returned values for session.run()

_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)

average_loss += loss_val

if step % 2000 == 0:

if step > 0:

average_loss /= 2000

# The average loss is an estimate of the loss over the last 2000 batches.

print("Average loss at step ", step, ": ", average_loss)

average_loss = 0

# Note that this is expensive (~20% slowdown if computed every 500 steps)

if step % 10000 == 0:

sim = similarity.eval()

for i in range(valid_size):

valid_word = reverse_dictionary[valid_examples[i]]

top_k = 8 # number of nearest neighbors

nearest = (-sim[i, :]).argsort()[1:top_k + 1]

log_str = "Nearest to %s:" % valid_word

for k in range(top_k):

close_word = reverse_dictionary[nearest[k]]

log_str = "%s %s," % (log_str, close_word)

print(log_str)

final_embeddings = normalized_embeddings.eval()

其实上边的训练很简单，每次迭代都会根据generate_batch产生batch_inputs, batch_labels，这就是要喂给graph的数据，然后就是执行迭代了，迭代过程中，每个2000次都会输出平均的误差，每个10000次都会计算一下valid_dataset中的词的前topK=8的相似词，最后final_embeddings存储的就是标准化的词向量。

-最后就是可视化

def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):

assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"

plt.figure(figsize=(18, 18)) # in inches

for i, label in enumerate(labels):

x, y = low_dim_embs[i, :]

plt.scatter(x, y)

plt.annotate(label,

xy=(x, y),

xytext=(5, 2),

textcoords='offset points',

ha='right',

va='bottom')

plt.savefig(filename)

try:

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)

plot_only = 500

low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])

labels = [reverse_dictionary[i] for i in range(plot_only)]

plot_with_labels(low_dim_embs, labels)

except ImportError:

print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")

可视化采用的是TSNE，这里就不多说了，如果项具体了解，请参考：数据降维，其他的就不多说了。

word2vec的spark实现

至于spark的实现就直接上代码了，这个很简单，而且官网上也有很详细的教程，个人感觉spark做的api简直就是再也不能人性化了，未来spark的方向也是深度学习和实时流，这个我个人感觉也算是走上spark的主流道路了。坐等人性化深度学习api的来临。

废话不多说，直接上代码。

object WordToVec {

def main(args :Array[String]): Unit ={

val conf = new SparkConf().setAppName("WordToVec")

.setMaster("local")

val sc = new SparkContext(conf)

val stopwords = Array("的","是","你","我","他","她","它","和","了","而","有","人","被","做","对","与") //无效词

val input = sc.textFile("c:/traindataw2v.txt")

.map(line => line.split(" "))

.map(_.filter(_.matches("[\u4E00-\u9FA5]+")).toSeq) //过滤中文

.map(_.filter(!stopwords.contains(_)))

.map(_.filter(_.length >= 2)) //长度必须大于2

val word2vec = new Word2Vec()

.setMinCount(2) //词频大于2的词才能入选词典

.setWindowSize(5) //上下文窗口长度为5

.setVectorSize(50) //词的向量维度为50

.setNumIterations(25) //迭代次数为25

.setNumPartitions(3) // 数据分区3

.setSeed(12345) //随机数产生seed

val model = word2vec.fit(input)

// model.save(sc, "D:/word2vecTmal")

// val model = Word2VecModel.load(sc,"D:/word2vecTmal")

val word = model.getVectors.keySet

val writer = new PrintWriter(new File("c:/resultw2v.txt" ))

model.getVectors.foreach(kv => {

writer.write(kv._1 + " => " + kv._2.mkString(" ") + "\n")

})

writer.close()

val synonyms = model.findSynonyms("很好", 5) //计算很好一次的5个最相似的词并输出

for((synonym, cosineSimilarity)

println(s"$synonym $cosineSimilarity")

}

sc.stop()

}

总结

个人建议，训练word2vec的时，如果想在单机情况下去训练的话最好用第一种方案，如果想在集群，或者数据量比较大的情况下可以采用分布式的spark训练，这两个的结果可靠性都要比tensorflow官方实现的要好。这跟tensorflow的实现方法是有直接关系的。

好了不多说了，大家可以自己去实践一下，毕竟我说的不算，实践是最好的老师。后续会持续书写相关的算法，敬请期待，都是干货，不掺水。

word2vec python实现_word2vec的几种实现相关推荐

word2vec python实现_word2vec及其python实现
词的向量化就是将自然语言中的词语映射成是一个实数向量,用于对自然语言建模,比如进行情感分析.语义分析等自然语言处理任务.下面介绍比较主流的两种词语向量化的方式: 第一种即One-Hot编码,,是一种基 ...
word2vec python 代码实现_python gensim使用word2vec词向量处理中文语料的方法
word2vec介绍 word2vec是google的一个开源工具,能够根据输入的词的集合计算出词与词之间的距离. 它将term转换成向量形式,可以把对文本内容的处理简化为向量空间中的向量运算,计算出 ...
用python下载文件的若干种方法汇总
压缩文件可以直接放到下载器里面下载的 you-get 连接下载任意文件重点用python下载文件的若干种方法汇总写文章用python下载文件的若干种方法汇总 zhangqibot发表于Met ...
基础必备 | Python处理文件系统的10种方法
作者 | Jeff Hale 译者 | 风车云马:责编 | Jane,Rachel 出品 | Python大本营(ID:pythonnews) [导读]在编写一些Python程序的时候,我们常常需要与 ...
Python 发送 email 的三种方式
Python发送email的三种方式,分别为使用登录邮件服务器.使用smtp服务.调用sendmail命令来发送三种方法本文原文自米扑博客:Python 发送 email 的三种方式 Python发 ...
测试Python下载图片的三种方法
简介: 通过Python软件包对网络URL图片链接进行下载,可以加快后期处理.本文测试了urllib, request两个软件包对图片进行下载效果.如果图片原网页有了防止下载机制,是无法下载图片. ...
Python 执行js的2种解决方案-乾颐堂
Python 执行js的2种解决方案-乾颐堂参考文章: (1)Python 执行js的2种解决方案-乾颐堂 (2)https://www.cnblogs.com/qytang/p/5580922.h ...
python不支持以下哪种数据类型_Python 不支持以下哪种数据类型？
Python 不支持以下哪种数据类型? 答:char 中国大学MOOC: 为了充分利用学习时间,下列方法可行的是: 答:尽量选择理想的固定场所学习\n充分利用等候和其它碎片时间\n把握一天中的最佳状态 ...
python使用方法视频-Python读取视频的两种方法（imageio和cv2）
用python读取视频有两种主要方法,大家可依据自己的需求进行使用. 方法一: 使用imageio库,没有安装的可用pip安装或自己下载,安装好后重启终端即可调用. import pylab impo ...

word2vec python实现_word2vec的几种实现

word2vec python实现_word2vec的几种实现相关推荐

最新文章

热门文章