语料库下载:

请参考:https://blog.csdn.net/weixin_35757704/article/details/115614112

1.训练Word2vec模型代码

单单使用gensim库训练word2vec模型的代码请参考:https://blog.csdn.net/weixin_35757704/article/details/115601271

2. 剔除停用词

主要代码是:

def get_stop_words(filepath) -> list:return open(filepath, 'r', encoding='utf-8').readlines()[0].split(',')def filter_stop_words(content:list) -> list:"""剔除停用词"""clean_content = []stop_words = get_stop_words('en_stopwords_line.txt')for word in content:if word not in stop_words:clean_content.append(word)return clean_content

停用词表en_stopwords_line.txt内容是,复制到一个文本中就行,注意这里是一行数据:

'd,'ll,'m,'re,'s,'t,'ve,ZT,ZZ,a,a's,able,about,above,abst,accordance,according,accordingly,across,act,actually,added,adj,adopted,affected,affecting,affects,after,afterwards,again,against,ah,ain't,all,allow,allows,almost,alone,along,already,also,although,always,am,among,amongst,an,and,announce,another,any,anybody,anyhow,anymore,anyone,anything,anyway,anyways,anywhere,apart,apparently,appear,appreciate,appropriate,approximately,are,area,areas,aren,aren't,arent,arise,around,as,aside,ask,asked,asking,asks,associated,at,auth,available,away,awfully,b,back,backed,backing,backs,be,became,because,become,becomes,becoming,been,before,beforehand,began,begin,beginning,beginnings,begins,behind,being,beings,believe,below,beside,besides,best,better,between,beyond,big,biol,both,brief,briefly,but,by,c,c'mon,c's,ca,came,can,can't,cannot,cant,case,cases,cause,causes,certain,certainly,changes,clear,clearly,co,com,come,comes,concerning,consequently,consider,considering,contain,containing,contains,corresponding,could,couldn't,couldnt,course,currently,d,date,definitely,describe,described,despite,did,didn't,differ,different,differently,discuss,do,does,doesn't,doing,don't,done,down,downed,downing,downs,downwards,due,during,e,each,early,ed,edu,effect,eg,eight,eighty,either,else,elsewhere,end,ended,ending,ends,enough,entirely,especially,et,et-al,etc,even,evenly,ever,every,everybody,everyone,everything,everywhere,ex,exactly,example,except,f,face,faces,fact,facts,far,felt,few,ff,fifth,find,finds,first,five,fix,followed,following,follows,for,former,formerly,forth,found,four,from,full,fully,further,furthered,furthering,furthermore,furthers,g,gave,general,generally,get,gets,getting,give,given,gives,giving,go,goes,going,gone,good,goods,got,gotten,great,greater,greatest,greetings,group,grouped,grouping,groups,h,had,hadn't,happens,hardly,has,hasn't,have,haven't,having,he,he's,hed,hello,help,hence,her,here,here's,hereafter,hereby,herein,heres,hereupon,hers,herself,hes,hi,hid,high,higher,highest,him,himself,his,hither,home,hopefully,how,howbeit,however,hundred,i,i'd,i'll,i'm,i've,id,ie,if,ignored,im,immediate,immediately,importance,important,in,inasmuch,inc,include,indeed,index,indicate,indicated,indicates,information,inner,insofar,instead,interest,interested,interesting,interests,into,invention,inward,is,isn't,it,it'd,it'll,it's,itd,its,itself,j,just,k,keep,keeps,kept,keys,kg,kind,km,knew,know,known,knows,l,large,largely,last,lately,later,latest,latter,latterly,least,less,lest,let,let's,lets,like,liked,likely,line,little,long,longer,longest,look,looking,looks,ltd,m,made,mainly,make,makes,making,man,many,may,maybe,me,mean,means,meantime,meanwhile,member,members,men,merely,mg,might,million,miss,ml,more,moreover,most,mostly,mr,mrs,much,mug,must,my,myself,n,n't,na,name,namely,nay,nd,near,nearly,necessarily,necessary,need,needed,needing,needs,neither,never,nevertheless,new,newer,newest,next,nine,ninety,no,nobody,non,none,nonetheless,noone,nor,normally,nos,not,noted,nothing,novel,now,nowhere,number,numbers,o,obtain,obtained,obviously,of,off,often,oh,ok,okay,old,older,oldest,omitted,on,once,one,ones,only,onto,open,opened,opening,opens,or,ord,order,ordered,ordering,orders,other,others,otherwise,ought,our,ours,ourselves,out,outside,over,overall,owing,own,p,page,pages,part,parted,particular,particularly,parting,parts,past,per,perhaps,place,placed,places,please,plus,point,pointed,pointing,points,poorly,possible,possibly,potentially,pp,predominantly,present,presented,presenting,presents,presumably,previously,primarily,probably,problem,problems,promptly,proud,provides,put,puts,q,que,quickly,quite,qv,r,ran,rather,rd,re,readily,really,reasonably,recent,recently,ref,refs,regarding,regardless,regards,related,relatively,research,respectively,resulted,resulting,results,right,room,rooms,run,s,said,same,saw,say,saying,says,sec,second,secondly,seconds,section,see,seeing,seem,seemed,seeming,seems,seen,sees,self,selves,sensible,sent,serious,seriously,seven,several,shall,she,she'll,shed,shes,should,shouldn't,show,showed,showing,shown,showns,shows,side,sides,significant,significantly,similar,similarly,since,six,slightly,small,smaller,smallest,so,some,somebody,somehow,someone,somethan,something,sometime,sometimes,somewhat,somewhere,soon,sorry,specifically,specified,specify,specifying,state,states,still,stop,strongly,sub,substantially,successfully,such,sufficiently,suggest,sup,sure,t,t's,take,taken,taking,tell,tends,th,than,thank,thanks,thanx,that,that'll,that's,that've,thats,the,their,theirs,them,themselves,then,thence,there,there'll,there's,there've,thereafter,thereby,thered,therefore,therein,thereof,therere,theres,thereto,thereupon,these,they,they'd,they'll,they're,they've,theyd,theyre,thing,things,think,thinks,third,this,thorough,thoroughly,those,thou,though,thoughh,thought,thoughts,thousand,three,throug,through,throughout,thru,thus,til,tip,to,today,together,too,took,toward,towards,tried,tries,truly,try,trying,ts,turn,turned,turning,turns,twice,two,u,un,under,unfortunately,unless,unlike,unlikely,until,unto,up,upon,ups,us,use,used,useful,usefully,usefulness,uses,using,usually,uucp,v,value,various,very,via,viz,vol,vols,vs,w,want,wanted,wanting,wants,was,wasn't,way,ways,we,we'd,we'll,we're,we've,wed,welcome,well,wells,went,were,weren't,what,what'll,what's,whatever,whats,when,whence,whenever,where,where's,whereafter,whereas,whereby,wherein,wheres,whereupon,wherever,whether,which,while,whim,whither,who,who'll,who's,whod,whoever,whole,whom,whomever,whos,whose,why,widely,will,willing,wish,with,within,without,won't,wonder,words,work,worked,working,works,world,would,wouldn't,www,x,y,year,years,yes,yet,you,you'd,you'll,you're,you've,youd,young,younger,youngest,your,youre,yours,yourself,yourselves,z,zero,zt,zz

3. 词型还原

使用NLTK做词型还原:

from nltk.stem import WordNetLemmatizerdef word_lemmatize(all_content):"""词性还原 Lemmatization"""lemmatize = WordNetLemmatizer()for i, content in enumerate(all_content):word = all_content[i]word = lemmatize.lemmatize(word, pos='v')word = lemmatize.lemmatize(word, pos='n')all_content[i] = lemmatize.lemmatize(word, pos='a')return all_content

4. 多进程

多进程请参考:https://blog.csdn.net/weixin_35757704/article/details/115674954

全部代码

import gensim
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec, word2vec
import multiprocessingdef get_stop_words(filepath) -> list:return open(filepath, 'r', encoding='utf-8').readlines()[0].split(',')def write_in_file(str_line):with open('train_wiki.txt', 'a+') as f:line = " ".join(str_line) + "\n"f.write(line)def filter_stop_words(content) -> list:"""剔除停用词"""clean_content = []stop_words = get_stop_words('en_stopwords_line.txt')for word in content:if word not in stop_words:clean_content.append(word)return clean_contentdef multiplication(content):content = lower_word(content)content = filter_stop_words(content)content = word_lemmatize(content)return contentdef lower_word(all_content):for i, content in enumerate(all_content):all_content[i] = content.lower()  # 文本转小写return all_contentdef word_lemmatize(all_content):"""词性还原 Lemmatization"""lemmatize = WordNetLemmatizer()for i, content in enumerate(all_content):word = all_content[i]word = lemmatize.lemmatize(word, pos='v')word = lemmatize.lemmatize(word, pos='n')all_content[i] = lemmatize.lemmatize(word, pos='a')return all_contentdef train_word2vec(words_file):# 可以用BrownCorpus,Text8Corpus或lineSentence来构建sentencessentences = list(word2vec.LineSentence(words_file))  # 加载分词后的文件model = Word2Vec(sentences, vector_size=350, window=5, sg=0, workers=multiprocessing.cpu_count())return modelif __name__ == '__main__':# 提取字符wiki = gensim.corpora.WikiCorpus('enwiki-latest-pages-articles.xml.bz2', dictionary={}) # 这是从维基百科官网下载的英文语料pool = multiprocessing.Pool(multiprocessing.cpu_count())for text in wiki.get_texts():  # 处理每一行pool.apply_async(func=multiplication, args=(text,), callback=write_in_file)print('Beigin Word2vec train')# 开始训练word2vec模型model = train_word2vec(words_file='train_wiki.txt')model.save('wiki_word2vec.model')print('WIKI WORD2VEC MODEL FINISH')

多进程使用wikimedia数据训练word2vec模型相关推荐

  1. 【NLP】维基百科中文数据训练word2vec词向量模型——基于gensim库

    前言   本篇主要是基于gensim 库中的 Word2Vec 模型,使用维基百科中文数据训练word2vec 词向量模型,大体步骤如下: 数据预处理 模型的训练 模型的测试 准备条件: Window ...

  2. 使用中文维基百科训练word2vec模型的最新方法!

    网上看了很多其他博客,发现有些部分都太老旧了,以至于现在套用都错误百出...这里总结了一下使用中文维基百科训练word2vec模型的最新方法. 参考链接: https://blog.csdn.net/ ...

  3. 使用中文维基百科训练word2vec模型

    一.下载原始数据 数据下载地址:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 ,或者在这 ...

  4. gensim流式训练word2vec模型,不需要一次性加载完整数据集

    首先训练文本train_text.txt文件内容如下: 优惠的政策和政府对产业发展的重视也吸引了更多医美企业来成都寻觅机遇.2018年成都医美机构的数量一度飙升至407家,较之前一年激增131家 中国 ...

  5. 是否有可能从python中的句子语料库重新训练word2vec模型(例如GoogleNews-vectors-negative300.bin)?

    是否有可能从python中的句子语料库重新训练word2vec模型(例如GoogleNews-vectors-negative300.bin)? http://www.voidcn.com/artic ...

  6. 使用自己的数据训练Yolov4-tiny模型,并用tensorrt运行(配置github host、编译安装opencv4.1.1+contrib和darknet、制作数据集、训练全流程)

    目录 0. 修改host文件 (选做) 1. 编译安装opencv 4.1.1+contrib 2. 准备训练环境 3. 制作自己的数据集 4. 预训练权重和配置文件 5. 创建训练配置文件 6. 训 ...

  7. paddlepaddle使用笔记——使用自己的数据训练ocr模型

    1.使用环境: ubuntu18.04,4gpu,nvidia410.78,cuda9.0,cudnn7.3,python3.6 2.使用代码: 官方提供的ocr模型代码 https://github ...

  8. 【用户行为分析】 用wiki百科中文语料训练word2vec模型

    本文地址: http://blog.csdn.net/hereiskxm/article/details/49664845  前言 最近在调研基于内容的用户行为分析,在过程中发现了word2vec这个 ...

  9. 使用gensim训练word2vec模型

    代码如下: from gensim.models import Word2Vec, word2vec import jieba import multiprocessing# 1. 停用词表 def ...

最新文章

  1. Java分布式内存开源实现:Hazelcast
  2. 项目收获与体会_员工之声|在项目实践中提高,在团队合作中成长——参与项目有感...
  3. opencv videoio无法读取rstp_使用一行Python代码从图像读取文本
  4. ITK:在傅立叶域中过滤图像
  5. Android中解析XML---数据存储
  6. 网络广告联盟和网站联盟全解析
  7. 让系统通过域用户自动打补丁
  8. 成年人改变生活的方式,都是从它开始
  9. cocos2d-x 学习资料(很全)
  10. Nodejs学习笔记(一)——基础之全局对象、包和npm
  11. Linux内核:VFIO 内核文档 (实例,API,bus驱动API)
  12. nodejs的事件处理机制
  13. PHP compact
  14. learning ddr tRP and tRP tRTP CL tRAS
  15. MySQL 5.7新特性:并行复制原理(MTS)
  16. java毕业设计电影推荐网站mybatis+源码+调试部署+系统+数据库+lw
  17. 【Python打卡2019】20190406之货币兑换
  18. mt管理器主题修改教程_华为微信气泡怎么设置皮肤 微信怎么改猫和老鼠的主题和气泡?...
  19. ntp VS chrony
  20. LiDAR点云处理软件

热门文章

  1. 浏览器js 获取手机标识信息_手机软件多次要求获取手机信息,习惯性让其通过有安全隐患?...
  2. 局域网弱口令扫描工具_漏洞扫描软件AWVS的介绍和使用
  3. iphone微信电话不弹出_iPhone快速分享电话号码的方法!
  4. 体绘制的原理和Raycasting的实现
  5. 代码管理工具TortoiseGit配置(GIT的客户端)
  6. Java入门系列-20-异常
  7. sklearn自定义svm核函数(外部和内部定义)
  8. xdu1068暨2013陕西省赛C题题解
  9. jquery easyui 多选下拉框的实现
  10. 为什么说区块链是具有革命性意义的?