NLP之词向量:利用word2vec对20类新闻文本数据集进行词向量训练、测试(某个单词的相关词汇)

目录

输出结果

设计思路

核心代码


输出结果

寻找训练文本中与morning最相关的10个词汇:
[('afternoon', 0.8329864144325256), ('weekend', 0.7690818309783936), ('evening', 0.7469204068183899),
('saturday', 0.7191835045814514), ('night', 0.7091601490974426), ('friday', 0.6764787435531616),
('sunday', 0.6380082368850708), ('newspaper', 0.6365975737571716), ('summer', 0.6268560290336609),
('season', 0.6137701272964478)]

寻找训练文本中与email最相关的10个词汇:
[('mail', 0.7432783842086792), ('contact', 0.6995242834091187), ('address', 0.6547545194625854),
('replies', 0.6502780318260193), ('mailed', 0.6334187388420105), ('request', 0.6262195110321045),
('sas', 0.6220622658729553), ('send', 0.6207413077354431), ('listserv', 0.617364227771759),
('compuserve', 0.5954489707946777)]

设计思路

核心代码

class Word2Vec(BaseWordEmbeddingsModel):"""Train, use and evaluate neural networks described in https://code.google.com/p/word2vec/.Once you're finished training a model (=no more updates, only querying)store and use only the :class:`~gensim.models.keyedvectors.KeyedVectors` instance in `self.wv` to reduce memory.The model can be stored/loaded via its :meth:`~gensim.models.word2vec.Word2Vec.save` and:meth:`~gensim.models.word2vec.Word2Vec.load` methods.The trained word vectors can also be stored/loaded from a format compatible with theoriginal word2vec implementation via `self.wv.save_word2vec_format`and :meth:`gensim.models.keyedvectors.KeyedVectors.load_word2vec_format`.Some important attributes are the following:Attributes----------wv : :class:`~gensim.models.keyedvectors.Word2VecKeyedVectors`This object essentially contains the mapping between words and embeddings. After training, it can be useddirectly to query those embeddings in various ways. See the module level docstring for examples.vocabulary : :class:'~gensim.models.word2vec.Word2VecVocab'This object represents the vocabulary (sometimes called Dictionary in gensim) of the model.Besides keeping track of all unique words, this object provides extra functionality, such asconstructing a huffman tree (frequent words are closer to the root), or discarding extremely rare words.trainables : :class:`~gensim.models.word2vec.Word2VecTrainables`This object represents the inner shallow neural network used to train the embeddings. The semantics of thenetwork differ slightly in the two available training modes (CBOW or SG) but you can think of it as a NN witha single projection and hidden layer which we train on the corpus. The weights are then used as our embeddings(which means that the size of the hidden layer is equal to the number of features `self.size`)."""def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(), max_final_vocab=None):"""Parameters----------sentences : iterable of iterables, optionalThe `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,consider an iterable that streams the sentences directly from disk/network.See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples.See also the `tutorial on data streaming in Python<https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/>`_.If you don't supply `sentences`, the model is left uninitialized -- use if you plan to initialize itin some other way.size : int, optionalDimensionality of the word vectors.window : int, optionalMaximum distance between the current and predicted word within a sentence.min_count : int, optionalIgnores all words with total frequency lower than this.workers : int, optionalUse these many worker threads to train the model (=faster training with multicore machines).sg : {0, 1}, optionalTraining algorithm: 1 for skip-gram; otherwise CBOW.hs : {0, 1}, optionalIf 1, hierarchical softmax will be used for model training.If 0, and `negative` is non-zero, negative sampling will be used.negative : int, optionalIf > 0, negative sampling will be used, the int for negative specifies how many "noise words"should be drawn (usually between 5-20).If set to 0, no negative sampling is used.ns_exponent : float, optionalThe exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportionto the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words morethan high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper.More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest thatother values may perform better for recommendation applications.cbow_mean : {0, 1}, optionalIf 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.alpha : float, optionalThe initial learning rate.min_alpha : float, optionalLearning rate will linearly drop to `min_alpha` as training progresses.seed : int, optionalSeed for the random number generator. Initial vectors for each word are seeded with a hash ofthe concatenation of word + `str(seed)`. Note that for a fully deterministically-reproducible run,you must also limit the model to a single worker thread (`workers=1`), to eliminate ordering jitterfrom OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requiresuse of the `PYTHONHASHSEED` environment variable to control hash randomization).max_vocab_size : int, optionalLimits the RAM during vocabulary building; if there are more uniquewords than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM.Set to `None` for no limit.max_final_vocab : int, optionalLimits the vocab to a target vocab size by automatically picking a matching min_count. If the specifiedmin_count is more than the calculated min_count, the specified min_count will be used.Set to `None` if not required.sample : float, optionalThe threshold for configuring which higher-frequency words are randomly downsampled,useful range is (0, 1e-5).hashfxn : function, optionalHash function to use to randomly initialize weights, for increased training reproducibility.iter : int, optionalNumber of iterations (epochs) over the corpus.trim_rule : function, optionalVocabulary trimming rule, specifies whether certain words should remain in the vocabulary,be trimmed away, or handled using the default (discard if word count < min_count).Can be None (min_count will be used, look to :func:`~gensim.utils.keep_vocab_item`),or a callable that accepts parameters (word, count, min_count) and returns either:attr:`gensim.utils.RULE_DISCARD`, :attr:`gensim.utils.RULE_KEEP` or :attr:`gensim.utils.RULE_DEFAULT`.The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of themodel.The input parameters are of the following types:* `word` (str) - the word we are examining* `count` (int) - the word's frequency count in the corpus* `min_count` (int) - the minimum count threshold.sorted_vocab : {0, 1}, optionalIf 1, sort the vocabulary by descending frequency before assigning word indexes.See :meth:`~gensim.models.word2vec.Word2VecVocab.sort_vocab()`.batch_words : int, optionalTarget size (in words) for batches of examples passed to worker threads (andthus cython routines).(Larger batches will be passed if individualtexts are longer than 10000 words, but the standard cython code truncates to that maximum.)compute_loss: bool, optionalIf True, computes and stores loss value which can be retrieved using:meth:`~gensim.models.word2vec.Word2Vec.get_latest_training_loss`.callbacks : iterable of :class:`~gensim.models.callbacks.CallbackAny2Vec`, optionalSequence of callbacks to be executed at specific stages during training.Examples--------Initialize and train a :class:`~gensim.models.word2vec.Word2Vec` model>>> from gensim.models import Word2Vec>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]>>> model = Word2Vec(sentences, min_count=1)"""self.max_final_vocab = max_final_vocabself.callbacks = callbacksself.load = call_on_class_onlyself.wv = Word2VecKeyedVectors(size)self.vocabulary = Word2VecVocab(max_vocab_size=max_vocab_size, min_count=min_count, sample=sample, sorted_vocab=bool(sorted_vocab), null_word=null_word, max_final_vocab=max_final_vocab, ns_exponent=ns_exponent)self.trainables = Word2VecTrainables(seed=seed, vector_size=size, hashfxn=hashfxn)super(Word2Vec, self).__init__(sentences=sentences, workers=workers, vector_size=size, epochs=iter, callbacks=callbacks, batch_words=batch_words, trim_rule=trim_rule, sg=sg, alpha=alpha, window=window, seed=seed, hs=hs, negative=negative, cbow_mean=cbow_mean, min_alpha=min_alpha, compute_loss=compute_loss, fast_version=FAST_VERSION)def _do_train_job(self, sentences, alpha, inits):"""Train the model on a single batch of sentences.Parameters----------sentences : iterable of list of strCorpus chunk to be used in this training batch.alpha : floatThe learning rate used in this batch.inits : (np.ndarray, np.ndarray)Each worker threads private work memory.Returns-------(int, int)2-tuple (effective word count after ignoring unknown words and sentence length trimming, total word count)."""work, neu1 = initstally = 0if self.sg:tally += train_batch_sg(self, sentences, alpha, work, self.compute_loss)else:tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)return tally, self._raw_word_count(sentences)def _clear_post_train(self):"""Remove all L2-normalized word vectors from the model."""self.wv.vectors_norm = Nonedef _set_train_params(self, **kwargs):if 'compute_loss' in kwargs:self.compute_loss = kwargs['compute_loss']self.running_training_loss = 0def train(self, sentences, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=()):"""Update the model's neural weights from a sequence of sentences.Notes-----To support linear learning-rate decay from (initial) `alpha` to `min_alpha`, and accurateprogress-percentage logging, either `total_examples` (count of sentences) or `total_words` (count ofraw words in sentences) **MUST** be provided. If `sentences` is the same corpusthat was provided to :meth:`~gensim.models.word2vec.Word2Vec.build_vocab` earlier,you can simply use `total_examples=self.corpus_count`.Warnings--------To avoid common mistakes around the model's ability to do multiple training passes itself, anexplicit `epochs` argument **MUST** be provided. In the common and recommended casewhere :meth:`~gensim.models.word2vec.Word2Vec.train` is only called once, you can set `epochs=self.iter`.Parameters----------sentences : iterable of list of strThe `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,consider an iterable that streams the sentences directly from disk/network.See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples.See also the `tutorial on data streaming in Python<https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/>`_.total_examples : int, optionalCount of sentences. Used to decay the `alpha` learning rate.total_words : int, optionalCount of raw words in sentences. Used to decay the `alpha` learning rate.epochs : int, optionalNumber of iterations (epochs) over the corpus.start_alpha : float, optionalInitial learning rate. If supplied, replaces the starting `alpha` from the constructor,for this one call to`train()`.Use only if making multiple calls to `train()`, when you want to manage the alpha learning-rate yourself(not recommended).end_alpha : float, optionalFinal learning rate. Drops linearly from `start_alpha`.If supplied, this replaces the final `min_alpha` from the constructor, for this one call to `train()`.Use only if making multiple calls to `train()`, when you want to manage the alpha learning-rate yourself(not recommended).word_count : int, optionalCount of words already trained. Set this to 0 for the usualcase of training on all words in sentences.queue_factor : int, optionalMultiplier for size of queue (number of workers * queue_factor).report_delay : float, optionalSeconds to wait before reporting progress.compute_loss: bool, optionalIf True, computes and stores loss value which can be retrieved using:meth:`~gensim.models.word2vec.Word2Vec.get_latest_training_loss`.callbacks : iterable of :class:`~gensim.models.callbacks.CallbackAny2Vec`, optionalSequence of callbacks to be executed at specific stages during training.Examples-------->>> from gensim.models import Word2Vec>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]>>>>>> model = Word2Vec(min_count=1)>>> model.build_vocab(sentences)  # prepare the model vocabulary>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)  # train word vectors(1, 30)"""return super(Word2Vec, self).train(sentences, total_examples=total_examples, total_words=total_words, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count, queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)def score(self, sentences, total_sentences=int(1e6), chunksize=100, queue_factor=2, report_delay=1):"""Score the log probability for a sequence of sentences.This does not change the fitted model in any way (see :meth:`~gensim.models.word2vec.Word2Vec.train` for that).Gensim has currently only implemented score for the hierarchical softmax scheme,so you need to have run word2vec with `hs=1` and `negative=0` for this to work.Note that you should specify `total_sentences`; you'll run into problems if you ask toscore more than this number of sentences but it is inefficient to set the value too high.See the `article by Matt Taddy: "Document Classification by Inversion of Distributed Language Representations"<https://arxiv.org/pdf/1504.07295.pdf>`_ and the`gensim demo <https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb>`_ for examples ofhow to use such scores in document classification.Parameters----------sentences : iterable of list of strThe `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,consider an iterable that streams the sentences directly from disk/network.See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples.total_sentences : int, optionalCount of sentences.chunksize : int, optionalChunksize of jobsqueue_factor : int, optionalMultiplier for size of queue (number of workers * queue_factor).report_delay : float, optionalSeconds to wait before reporting progress."""if FAST_VERSION < 0:warnings.warn("C extension compilation failed, scoring will be slow. ""Install a C compiler and reinstall gensim for fastness.")logger.info("scoring sentences with %i workers on %i vocabulary and %i features, ""using sg=%s hs=%s sample=%s and negative=%s", self.workers, len(self.wv.vocab), self.trainables.layer1_size, self.sg, self.hs, self.vocabulary.sample, self.negative)if not self.wv.vocab:raise RuntimeError("you must first build vocabulary before scoring new data")if not self.hs:raise RuntimeError("We have currently only implemented score for the hierarchical softmax scheme, ""so you need to have run word2vec with hs=1 and negative=0 for this to work.")def worker_loop():"""Compute log probability for each sentence, lifting lists of sentences from the jobs queue."""work = zeros(1, dtype=REAL) # for sg hs, we actually only need one memory loc (running sum)neu1 = matutils.zeros_aligned(self.trainables.layer1_size, dtype=REAL)while True:job = job_queue.get()if job is None: # signal to finishbreakns = 0for sentence_id, sentence in job:if sentence_id >= total_sentences:breakif self.sg:score = score_sentence_sg(self, sentence, work)else:score = score_sentence_cbow(self, sentence, work, neu1)sentence_scores[sentence_id] = scorens += 1progress_queue.put(ns) # report progressstart, next_report = default_timer(), 1.0 # buffer ahead only a limited number of jobs.. this is the reason we can't simply use ThreadPool :(job_queue = Queue(maxsize=queue_factor * self.workers)progress_queue = Queue(maxsize=(queue_factor + 1) * self.workers)workers = [threading.Thread(target=worker_loop) for _ in xrange(self.workers)]for thread in workers:thread.daemon = True # make interrupting the process with ctrl+c easierthread.start()sentence_count = 0sentence_scores = matutils.zeros_aligned(total_sentences, dtype=REAL)push_done = Falsedone_jobs = 0jobs_source = enumerate(utils.grouper(enumerate(sentences), chunksize))# fill jobs queue with (id, sentence) job itemswhile True:try:job_no, items = next(jobs_source)if (job_no - 1) * chunksize > total_sentences:logger.warning("terminating after %i sentences (set higher total_sentences if you want more).", total_sentences)job_no -= 1raise StopIteration()logger.debug("putting job #%i in the queue", job_no)job_queue.put(items)except StopIteration:logger.info("reached end of input; waiting to finish %i outstanding jobs", job_no - done_jobs + 1)for _ in xrange(self.workers):job_queue.put(None) # give the workers heads up that they can finish -- no more work!push_done = Truetry:while done_jobs < (job_no + 1) or not push_done:ns = progress_queue.get(push_done) # only block after all jobs pushedsentence_count += nsdone_jobs += 1elapsed = default_timer() - startif elapsed >= next_report:logger.info("PROGRESS: at %.2f%% sentences, %.0f sentences/s", 100.0 * sentence_count, sentence_count / elapsed)next_report = elapsed + report_delay # don't flood log, wait report_delay secondselse:break # loop ended by job count; really doneexcept Empty:pass # already out of loop; continue to next pushelapsed = default_timer() - startself.clear_sims()logger.info("scoring %i sentences took %.1fs, %.0f sentences/s", sentence_count, elapsed, sentence_count / elapsed)return sentence_scores[:sentence_count]def clear_sims(self):"""Remove all L2-normalized word vectors from the model, to free up memory.You can recompute them later again using the :meth:`~gensim.models.word2vec.Word2Vec.init_sims` method."""self.wv.vectors_norm = Nonedef intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict'):"""Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format,where it intersects with the current vocabulary.No words are added to the existing vocabulary, but intersecting words adopt the file's weights, andnon-intersecting words are left alone.Parameters----------fname : strThe file path to load the vectors from.lockf : float, optionalLock-factor value to be set for any imported word-vectors; thedefault value of 0.0 prevents further updating of the vector during subsequenttraining. Use 1.0 to allow further training updates of merged vectors.binary : bool, optionalIf True, `fname` is in the binary word2vec C format.encoding : str, optionalEncoding of `text` for `unicode` function (python2 only).unicode_errors : str, optionalError handling behaviour, used as parameter for `unicode` function (python2 only)."""overlap_count = 0logger.info("loading projection weights from %s", fname)with utils.smart_open(fname) as fin:header = utils.to_unicode(fin.readline(), encoding=encoding)vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file formatif not vector_size == self.wv.vector_size:raise ValueError("incompatible vector size %d in file %s" % (vector_size, fname)) # TOCONSIDER: maybe mismatched vectors still useful enough to merge (truncating/padding)?if binary:binary_len = dtype(REAL).itemsize * vector_sizefor _ in xrange(vocab_size): # mixed text and binary: read text first, then binaryword = []while True:ch = fin.read(1)if ch == b' ':breakif ch != b'\n': # ignore newlines in front of words (some binary files have)word.append(ch)word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors)weights = fromstring(fin.read(binary_len), dtype=REAL)if word in self.wv.vocab:overlap_count += 1self.wv.vectors[self.wv.vocab[word].index] = weightsself.trainables.vectors_lockf[self.wv.vocab[word].index] = lockf # lock-factor: 0.0=no changeselse:for line_no, line in enumerate(fin):parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")if len(parts) != vector_size + 1:raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)word, weights = parts[0], [REAL(x) for x in parts[1:]]if word in self.wv.vocab:overlap_count += 1self.wv.vectors[self.wv.vocab[word].index] = weightsself.trainables.vectors_lockf[self.wv.vocab[word].index] = lockf # lock-factor: 0.0=no changeslogger.info("merged %d vectors into %s matrix from %s", overlap_count, self.wv.vectors.shape, fname)@deprecated("Method will be removed in 4.0.0, use self.wv.__getitem__() instead")def __getitem__(self, words):"""Deprecated. Use `self.wv.__getitem__` instead.Refer to the documentation for :meth:`~gensim.models.keyedvectors.Word2VecKeyedVectors.__getitem__`."""return self.wv.__getitem__(words)@deprecated("Method will be removed in 4.0.0, use self.wv.__contains__() instead")def __contains__(self, word):"""Deprecated. Use `self.wv.__contains__` instead.Refer to the documentation for :meth:`~gensim.models.keyedvectors.Word2VecKeyedVectors.__contains__`."""return self.wv.__contains__(word)def predict_output_word(self, context_words_list, topn=10):"""Get the probability distribution of the center word given context words.Parameters----------context_words_list : list of strList of context words.topn : int, optionalReturn `topn` words and their probabilities.Returns-------list of (str, float)`topn` length list of tuples of (word, probability)."""if not self.negative:raise RuntimeError("We have currently only implemented predict_output_word for the negative sampling scheme, ""so you need to have run word2vec with negative > 0 for this to work.")if not hasattr(self.wv, 'vectors') or not hasattr(self.trainables, 'syn1neg'):raise RuntimeError("Parameters required for predicting the output words not found.")word_vocabs = [self.wv.vocab[w] for w in context_words_list if w in self.wv.vocab]if not word_vocabs:warnings.warn("All the input context words are out-of-vocabulary for the current model.")return Noneword2_indices = [word.index for word in word_vocabs]l1 = np_sum(self.wv.vectors[word2_indices], axis=0)if word2_indices and self.cbow_mean:l1 /= len(word2_indices)# propagate hidden -> output and take softmax to get probabilitiesprob_values = exp(dot(l1, self.trainables.syn1neg.T))prob_values /= sum(prob_values)top_indices = matutils.argsort(prob_values, topn=topn, reverse=True) # returning the most probable output words with their probabilitiesreturn [(self.wv.index2word[index1], prob_values[index1]) for index1 in top_indices]def init_sims(self, replace=False):"""Deprecated. Use `self.wv.init_sims` instead.See :meth:`~gensim.models.keyedvectors.Word2VecKeyedVectors.init_sims`."""if replace and hasattr(self.trainables, 'syn1'):del self.trainables.syn1return self.wv.init_sims(replace)def reset_from(self, other_model):"""Borrow shareable pre-built structures from `other_model` and reset hidden layer weights.Structures copied are:* Vocabulary* Index to word mapping* Cumulative frequency table (used for negative sampling)* Cached corpus lengthUseful when testing multiple models on the same corpus in parallel.Parameters----------other_model : :class:`~gensim.models.word2vec.Word2Vec`Another model to copy the internal structures from."""self.wv.vocab = other_model.wv.vocabself.wv.index2word = other_model.wv.index2wordself.vocabulary.cum_table = other_model.vocabulary.cum_tableself.corpus_count = other_model.corpus_countself.trainables.reset_weights(self.hs, self.negative, self.wv)@staticmethoddef log_accuracy(section):"""Deprecated. Use `self.wv.log_accuracy` instead.See :meth:`~gensim.models.word2vec.Word2VecKeyedVectors.log_accuracy`."""return Word2VecKeyedVectors.log_accuracy(section)@deprecated("Method will be removed in 4.0.0, use self.wv.evaluate_word_analogies() instead")def accuracy(self, questions, restrict_vocab=30000, most_similar=None, case_insensitive=True):"""Deprecated. Use `self.wv.accuracy` instead.See :meth:`~gensim.models.word2vec.Word2VecKeyedVectors.accuracy`."""most_similar = most_similar or Word2VecKeyedVectors.most_similarreturn self.wv.accuracy(questions, restrict_vocab, most_similar, case_insensitive)def __str__(self):"""Human readable representation of the model's state.Returns-------strHuman readable representation of the model's state, including the vocabulary size, vector sizeand learning rate."""return "%s(vocab=%s, size=%s, alpha=%s)" % (self.__class__.__name__, len(self.wv.index2word), self.wv.vector_size, self.alpha)def delete_temporary_training_data(self, replace_word_vectors_with_normalized=False):"""Discard parameters that are used in training and scoring, to save memory.Warnings--------Use only if you're sure you're done training a model.Parameters----------replace_word_vectors_with_normalized : bool, optionalIf True, forget the original (not normalized) word vectors and only keepthe L2-normalized word vectors, to save even more memory."""if replace_word_vectors_with_normalized:self.init_sims(replace=True)self._minimize_model()def save(self, *args, **kwargs):"""Save the model.This saved model can be loaded again using :func:`~gensim.models.word2vec.Word2Vec.load`, which supportsonline training and getting vectors for vocabulary words.Parameters----------fname : strPath to the file."""# don't bother storing the cached normalized vectors, recalculable tablekwargs['ignore'] = kwargs.get('ignore', ['vectors_norm', 'cum_table'])super(Word2Vec, self).save(*args, **kwargs)def get_latest_training_loss(self):"""Get current value of the training loss.Returns-------floatCurrent training loss."""return self.running_training_loss@deprecated("Method will be removed in 4.0.0, keep just_word_vectors = model.wv to retain just the KeyedVectors instance")def _minimize_model(self, save_syn1=False, save_syn1neg=False, save_vectors_lockf=False):if save_syn1 and save_syn1neg and save_vectors_lockf:returnif hasattr(self.trainables, 'syn1') and not save_syn1:del self.trainables.syn1if hasattr(self.trainables, 'syn1neg') and not save_syn1neg:del self.trainables.syn1negif hasattr(self.trainables, 'vectors_lockf') and not save_vectors_lockf:del self.trainables.vectors_lockfself.model_trimmed_post_training = True@classmethoddef load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=REAL):"""Deprecated. Use :meth:`gensim.models.KeyedVectors.load_word2vec_format` instead."""raise DeprecationWarning("Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.")def save_word2vec_format(self, fname, fvocab=None, binary=False):"""Deprecated. Use `model.wv.save_word2vec_format` instead.See :meth:`gensim.models.KeyedVectors.save_word2vec_format`."""raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")@classmethoddef load(cls, *args, **kwargs):"""Load a previously saved :class:`~gensim.models.word2vec.Word2Vec` model.See Also--------:meth:`~gensim.models.word2vec.Word2Vec.save`Save model.Parameters----------fname : strPath to the saved file.Returns-------:class:`~gensim.models.word2vec.Word2Vec`Loaded model."""try:model = super(Word2Vec, cls).load(*args, **kwargs)# for backward compatibility for `max_final_vocab` featureif not hasattr(model, 'max_final_vocab'):model.max_final_vocab = Nonemodel.vocabulary.max_final_vocab = Nonereturn modelexcept AttributeError:logger.info('Model saved using code from earlier Gensim Version. Re-loading old model in a compatible way.')from gensim.models.deprecated.word2vec import load_old_word2vecreturn load_old_word2vec(*args, **kwargs)

NLP之词向量:利用word2vec对20类新闻文本数据集进行词向量训练、测试(某个单词的相关词汇)相关推荐

  1. ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

    ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测.评估 目录 输出结果 设计思路 核心代码 输出结果 设计思路 核心代码 class ...

  2. ML之SVM:利用SVM算法(超参数组合进行多线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测、评估

    ML之SVM:利用SVM算法(超参数组合进行多线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测.评估 目录 输出结果 设计思路 核心代码 输出结果 Fitting 3 folds for ...

  3. ML之SVM:利用SVM算法(超参数组合进行单线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测、评估

    ML之SVM:利用SVM算法(超参数组合进行单线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测.评估 目录 输出结果 设计思路 核心代码 输出结果 Fitting 3 folds for ...

  4. Dataset:fetch_20newsgroups(20类新闻文本)数据集的简介、安装、使用方法之详细攻略

    Dataset:fetch_20newsgroups(20类新闻文本)数据集的简介.安装.使用方法之详细攻略 目录 fetch_20newsgroups(20类新闻文本)数据集的简介 1.数据集信息 ...

  5. ML之NB:基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测

    ML之NB:基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测 目录 输出结果 设计思路 核心代码 输出结果 设计思路 核心代码 vec = CountVectorizer() X_trai ...

  6. ML之NB:利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测、评估

    ML之NB:利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测.评估 目录 输出结果 设计思路 核心代码 ...

  7. ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

    ML之NB:基于news新闻文本数据集利用纯统计法.kNN.朴素贝叶斯(高斯/多元伯努利/多项式).线性判别分析LDA.感知器等算法实现文本分类预测 目录 基于news新闻文本数据集利用纯统计法.kN ...

  8. 利用朴素贝叶斯进行新闻文本分类

    初探文本分类,本文使用的数据是5000条中文新闻文本数据,目的是使用朴素贝叶斯算法,对中文新闻文本进行分类预测.流程如下: 文本数据载入及清洗 搜狗新闻数据源:http://www.sogou.com ...

  9. tensorflow2.0实现IMDB文本数据集学习词嵌入

    1. IMDB数据集示例如下所示 [{"rating": 5, "title": "The dark is rising!", " ...

最新文章

  1. 如何把apdu[decode_len]打印出来
  2. 【NLP】Transformer模型深度解读
  3. 上海消保委评饿了么“多等5分钟”功能:逻辑上有问题
  4. 农信社计算机知识,农信社备考:计算机基础知识(15)
  5. keil中如何查看代码大小
  6. android 视频通话开启呼叫等待后,来第三方的视频通话,接通后通话时间一直显示为0,过几秒之后视频通话自己主动挂断...
  7. C# DataGridView 如何选中整行
  8. 解决js弹窗网页出现白屏
  9. Web 前端开发精华文章集锦(jQuery、HTML5、CSS3)【系列十七】
  10. 《thor过滤器 thor过滤规则合集资源》500+
  11. vector的几种初始化及赋值方式
  12. 绘画教程:动漫人体肌肉的详细画法
  13. 智能额温枪软件设计红外测温仪方案开发
  14. 剑指offer之斐波那契数列求解
  15. html2canvas.js 截屏微信头像不显示
  16. ACM比赛完了后怎么办
  17. sql server数据库事务日志已满请参阅log_reuse_wait_desc怎么解决?
  18. 二、VSCode——MiKTeX编写latex编码
  19. 人人自媒体的时代,程序员该如何利用好自己的优势?我记住了这些神器...
  20. Ubuntu 快捷键使用说明(一)--截图

热门文章

  1. node 生成随机头像_唯一ID生成算法剖析
  2. 【转】【OpenCV入门教程之一】 安装OpenCV:OpenCV 3.0、OpenCV 2.4.8、OpenCV 2.4.9 +VS 开发环境配置
  3. c语言编写程序数一下 1到100的所有整数中出现多少次数字9
  4. python处理u开头的字符串
  5. VR眼镜,怎样才算性感?
  6. Spring @Async注解
  7. 如何用 Git 优雅回退代码,别搞错了!
  8. 一条SQL语句在MySQL中是如何执行的
  9. 亚马逊抢甲骨文的 Java 饭碗,推出 Corretto
  10. Java 必看的 Spring 知识汇总!