Spark word2vec使用
Spark 提供有两个包提供了word2vec, 分别是
- spark.mllib contains the original API built on top of RDDs.
- spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
// $example on$ import org.apache.spark.ml.feature.Word2Vec // $example off$ import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext}object Word2VecExample {def main(args: Array[String]) {val conf = new SparkConf().setAppName("Word2Vec example")val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)// $example on$ // Input data: Each row is a bag of words from a sentence or document. val documentDF = sqlContext.createDataFrame(Seq("Hi I heard about Spark".split(" "), "I wish Java could use case classes".split(" "), "Logistic regression models are neat".split(" ")).map(Tuple1.apply)).toDF("text")// Learn a mapping from words to Vectors. val word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(3).setMinCount(0)val model = word2Vec.fit(documentDF)val result = model.transform(documentDF)result.select("result").take(3).foreach(println)// $example off$ } }
package mllibimport org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} //import org.apache.spark.ml.feature.Word2Vec import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext}/** * Created by Zhili on 2016/12/8. */ object TestWord2Vec {def main(args: Array[String]) {val conf = new SparkConf().setMaster("local").setAppName("test word2vec ")val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)val input = sc.textFile("G:\\BigData\\spark-1.6.1\\spark-1.6.1\\data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec() val model = word2vec.fit(input)
val synonyms = model.findSynonyms("1", 5) for((synonym, cosineSimilarity) <- synonyms) { println(s"$synonym $cosineSimilarity") } // $example off$ }}
val documentDF = sqlContext.createDataFrame(Seq( "Hi I heard about Spark".split(" "), "I wish Java could use case classes".split(" "), "Logistic regression models are neat".split(" ") ).map(Tuple1.apply)).toDF("text")val word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(3).setMinCount(0) val model = word2Vec.fit(documentDF) val result = model.transform(documentDF) result.select("result").foreach(println) result.foreach(println)val vecs = model.getVectors
[heard,[-0.053943559527397156,0.14666683971881866,-0.002084704814478755]] [are,[-0.1626942902803421,-0.14475099742412567,0.11403404921293259]] [neat,[-0.040669217705726624,0.028109613806009293,-0.16291147470474243]] [classes,[-0.14895521104335785,-0.04932709410786629,0.033132851123809814]] [I,[-0.01939234882593155,-0.13069935142993927,0.142848938703537]] [regression,[0.16618786752223969,0.06462828069925308,0.09238555282354355]] [Logistic,[0.03707731142640114,0.057859670370817184,-0.02196161448955536]] [Spark,[-0.12654900550842285,0.09880610555410385,-0.10377782583236694]] [could,[0.15330205857753754,0.06014800816774368,0.07719922065734863]] [use,[0.08296778798103333,0.0022815780248492956,-0.07951478660106659]] [Hi,[-0.05663909390568733,0.009638422168791294,-0.033786069601774216]] [models,[0.11944633722305298,0.13371166586875916,0.14418047666549683]] [case,[0.14063765108585358,0.08095260709524155,0.15926401317119598]] [about,[0.11595913767814636,0.10366207361221313,-0.06955810636281967]] [Java,[0.122124083340168,-0.031705472618341446,-0.1425546556711197]] [wish,[0.14893394708633423,-0.11224708706140518,-0.040021225810050964]]
deffindSynonyms(vector: Vector, num: Int): Array[(String, Double)]
Find synonyms of the vector representation of a word, possibly including any words in the model vocabulary whose vector respresentation is the supplied vector.
deffindSynonyms(word: String, num: Int): Array[(String, Double)]
Find synonyms of a word; do not include the word itself in results.
defgetVectors: Map[String, Array[Float]]
defsave(sc: SparkContext, path: String): Unit
deftransform(word: String): Vector
Transforms a word to its vector representation
来源: http://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/feature/Word2VecModel.html
deftransform(dataset: Dataset[_]): DataFrame
Transform a sentence column to a vector column to represent the whole sentence. The transform is performed by averaging all word vectors it contains.
val conf = new SparkConf().setMaster("local").setAppName("test word2vec ")val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc)val data_path = "G:\\BigData\\spark-1.6.1\\spark-1.6.1\\data/mllib/sample_lda_data.txt" val input = sc.textFile(data_path).map(line => line.split(" ").toSeq)val documentDF = sqlContext.createDataFrame(input.map(Tuple1.apply)).toDF("text")val word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(3).setMinCount(0) val model = word2Vec.fit(documentDF)val vecs = model.getVectors vecs.foreach(println)val out_path = "./data/word2vec" vecs.rdd.saveAsTextFile(out_path)
Spark word2vec使用相关推荐
- spark word2vec 源码详细解析
spark word2vec 源码详细解析 简单介绍spark word2vec skip-gram 层次softmax版本的源码解析 word2vec 的原理 只需要看层次哈弗曼树skip-gram ...
- spark Word2Vec+LSH相似文本推荐(scala)
在上一篇博客,我们使用spark CountVectorizer与IDF进行了关键词提取,博客地址: spark CountVectorizer+IDF提取中文关键词(scala) 本篇博客在上一篇博 ...
- Word2Vec原理及应用与文章相似度(推荐系统方法)
Word2Vec与文章相似度(推荐系统方法) 学习目标 目标 知道文章向量计算方式 了解Word2Vec模型原理 知道文章相似度计算方式 应用 应用Spark完成文章相似度计算 1 文章相似度 在我们 ...
- 机器学习项目搭建试验 where2go
https://github.com/da248/where2go 这个项目感觉还是挺好的,虽然没给各个数据集的下载链接,也有一些莫名其妙的bug,但是错误调试提示都还挺全,能一直有进展. (看了下这 ...
- Python黑马头条推荐系统第一天 架构介绍和离线计算更新Item画像
Python黑马头条推荐系统项目课程定位.目标 定位 课程是机器学习(包含推荐算法)算法原理在推荐系统的实践 深入推荐系统的业务流场景.工具使用 作为人工智能的数据挖掘(推荐系统)方向应用项目 目标 ...
- 电影推荐系统Sparrow Recsys源码解读
(慢慢补充中-) 来源:王喆老师的github https://github.com/wzhe06/SparrowRecSys Spark处理部分 Embedding 本部分代码的功能有: 制作用于模 ...
- 推荐系统[一]:超详细知识介绍,一份完整的入门指南,解答推荐系统相关算法流程、衡量指标和应用,以及如何使用jieba分词库进行相似推荐,业界广告推荐技术最新进展
搜索推荐系统专栏简介:搜索推荐全流程讲解(召回粗排精排重排混排).系统架构.常见问题.算法项目实战总结.技术细节以及项目实战(含码源) 专栏详细介绍:搜索推荐系统专栏简介:搜索推荐全流程讲解(召回粗排 ...
- spark scala word2vec 和多层分类感知器在情感分析中的实际应用
转自:http://www.cnblogs.com/canyangfeixue/p/7227998.html 对于威胁检测算法使用神经网络训练有用!!!TODO待实验 /*** Created by ...
- 离线轻量级大数据平台Spark之MLib机器学习库Word2Vec实例
Word2Vecword2vec能将文本中出现的词向量化,可以在捕捉语境信息的同时压缩数据规模.Word2Vec实际上是两种不同的方法:Continuous Bag of Words (CBOW) 和 ...
最新文章
- 争夺基础架构主导权,AI 新一轮战争将打响?
- Spring-学习笔记06【spring_day02资料_dbutils】
- Pandas打印所有行和列(显示所有的行和列)
- [转]世界十大最美历史遗迹[组图]。
- flask echarts词云可视化_基于flask框架的高校舆情分析系统
- Eclipse之如何导入arr文件
- Python面向对象基础一
- 解读java_Java字节码解读
- 华为Y9s海外官网上架:升降式全面屏+侧面指纹识别
- 苹果发布iOS 12.4首个测试版 苹果信用卡即将来袭
- UVA516 POJ1365 LA5533 ZOJ1261 Prime Land【欧拉筛法】
- 抖音seo,抖音优化系统,抖音seo矩阵系统源码技术搭建
- 瞳孔中的视觉刺激提取大脑中ERD/ERS
- xml保存图片和读取图片
- IOS加密技术之——3DES加密解密技术(记录)
- 武田宣布日本核准Moderna的新冠疫苗
- 关于canvas.toDataURL的那些坑
- windows mklink 使用 相对路径链接
- 步步为营-50-事务
- 微软黑屏,360坐收渔利
热门文章
- java从json数组中提取数据,从JSON数组中提取数据
- SOHO中国高管建“老鼠仓”吸钱 大企成空壳谁之责?
- 向量叉乘求三维空间中两直线(或线段)的交点
- 2021年N1叉车司机考试题库及N1叉车司机模拟考试
- 算力寻租或将终结中本聪的POW机制?深度解析BCH“司机补贴战”
- 手持式频谱分析仪 TFN的715c和760c怎么样
- 大学毕业4年-回顾和总结(7)-全局观
- c#获取计算机制造商信息
- cisco路由器的时间标记
- Levenshtein 自动机(拼音纠错)