Spark word2vec使用

Spark 提供有两个包提供了word2vec，分别是

org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

org.apache.spark.ml.feature.Word2Vec

本质没有太大的区别，只是两个包的作用对象不一样

spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

mllib直接用于RDD，ml用于DataFrames

以下是使用者两种来跑word2vec的例子，都是官方给的例子，注意导入数据的类型

使用ml

// $example on$
import org.apache.spark.ml.feature.Word2Vec
// $example off$
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}object Word2VecExample {def main(args: Array[String]) {val conf = new SparkConf().setAppName("Word2Vec example")val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)// $example on$
    // Input data: Each row is a bag of words from a sentence or document.
    val documentDF = sqlContext.createDataFrame(Seq("Hi I heard about Spark".split(" "),
      "I wish Java could use case classes".split(" "),
      "Logistic regression models are neat".split(" ")).map(Tuple1.apply)).toDF("text")// Learn a mapping from words to Vectors.
    val word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(3).setMinCount(0)val model = word2Vec.fit(documentDF)val result = model.transform(documentDF)result.select("result").take(3).foreach(println)// $example off$
  }
}

其中documentDF是DataFrame对象

使用mllib

package mllibimport org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
//import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}/**
  * Created by Zhili on 2016/12/8.
  */
object TestWord2Vec {def main(args: Array[String]) {val conf = new SparkConf().setMaster("local").setAppName("test word2vec ")val sc = new SparkContext(conf)val sqlContext = new SQLContext(sc)val input = sc.textFile("G:\\BigData\\spark-1.6.1\\spark-1.6.1\\data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)

    val word2vec = new Word2Vec()    val model = word2vec.fit(input)

    val synonyms = model.findSynonyms("1", 5)    for((synonym, cosineSimilarity) <- synonyms) {      println(s"$synonym $cosineSimilarity")    }    // $example off$  }}

其中input对象是RDD[Seq[String]] 对象

其次需要注意的是，上面两个例子得到的并不是词及其词向量

第一个例子，得到的是每个句子的向量化表示

第二个例子，得到的是与词“1”相近的5个词

如果需要得到，每个词及其训练得出的词向量，代码如下

val documentDF = sqlContext.createDataFrame(Seq(
  "Hi I heard about Spark".split(" "),
  "I wish Java could use case classes".split(" "),
  "Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")val word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(3).setMinCount(0)
val model = word2Vec.fit(documentDF)
val result = model.transform(documentDF)
result.select("result").foreach(println)
result.foreach(println)val vecs = model.getVectors

核心就是最后一句，getVertors得到的结果为

[heard,[-0.053943559527397156,0.14666683971881866,-0.002084704814478755]]
[are,[-0.1626942902803421,-0.14475099742412567,0.11403404921293259]]
[neat,[-0.040669217705726624,0.028109613806009293,-0.16291147470474243]]
[classes,[-0.14895521104335785,-0.04932709410786629,0.033132851123809814]]
[I,[-0.01939234882593155,-0.13069935142993927,0.142848938703537]]
[regression,[0.16618786752223969,0.06462828069925308,0.09238555282354355]]
[Logistic,[0.03707731142640114,0.057859670370817184,-0.02196161448955536]]
[Spark,[-0.12654900550842285,0.09880610555410385,-0.10377782583236694]]
[could,[0.15330205857753754,0.06014800816774368,0.07719922065734863]]
[use,[0.08296778798103333,0.0022815780248492956,-0.07951478660106659]]
[Hi,[-0.05663909390568733,0.009638422168791294,-0.033786069601774216]]
[models,[0.11944633722305298,0.13371166586875916,0.14418047666549683]]
[case,[0.14063765108585358,0.08095260709524155,0.15926401317119598]]
[about,[0.11595913767814636,0.10366207361221313,-0.06955810636281967]]
[Java,[0.122124083340168,-0.031705472618341446,-0.1425546556711197]]
[wish,[0.14893394708633423,-0.11224708706140518,-0.040021225810050964]]

最后来看看API

关于得到结果的几个方法，这个是来自mllib包，注意返回类型

deffindSynonyms(vector: Vector, num: Int): Array[(String, Double)]

Find synonyms of the vector representation of a word, possibly including any words in the model vocabulary whose vector respresentation is the supplied vector.
deffindSynonyms(word: String, num: Int): Array[(String, Double)]

Find synonyms of a word; do not include the word itself in results.
defgetVectors: Map[String, Array[Float]]

Returns a map of words to their vector representations.
defsave(sc: SparkContext, path: String): Unit

Save this model to the given path.
deftransform(word: String): Vector

Transforms a word to its vector representation

来源： http://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/feature/Word2VecModel.html

其实都是来自 Word2VecModel

最关心的是两个方法

getVectors() ：得到语料中所有词及其词向量

transform() ：将训练语料中，一行，也就是一个句子，表示成一个向量

返回的都是Dense Vector

对于transform

deftransform(dataset: Dataset[_]): DataFrame

Transform a sentence column to a vector column to represent the whole sentence. The transform is performed by averaging all word vectors it contains.

来源： http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/feature/Word2VecModel.h

他的处理方式是，对句子中所有的词的向量取平均来作为句子的向量表示，最naive的表示法

主要，这个API来自ml包，所有其返回类型是DataFrame，上面的是Vector

补充：

按照mllib的一般读取方法读取文件，采用DataFrame的方式跑算法

val conf = new SparkConf().setMaster("local").setAppName("test word2vec ")val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)val data_path = "G:\\BigData\\spark-1.6.1\\spark-1.6.1\\data/mllib/sample_lda_data.txt"

val input = sc.textFile(data_path).map(line => line.split(" ").toSeq)val  documentDF = sqlContext.createDataFrame(input.map(Tuple1.apply)).toDF("text")val word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(3).setMinCount(0)
val model = word2Vec.fit(documentDF)val vecs = model.getVectors
vecs.foreach(println)val out_path = "./data/word2vec"
vecs.rdd.saveAsTextFile(out_path)

Spark word2vec使用相关推荐

spark word2vec 源码详细解析
spark word2vec 源码详细解析简单介绍spark word2vec skip-gram 层次softmax版本的源码解析 word2vec 的原理只需要看层次哈弗曼树skip-gram ...
spark Word2Vec+LSH相似文本推荐（scala）
在上一篇博客,我们使用spark CountVectorizer与IDF进行了关键词提取,博客地址: spark CountVectorizer+IDF提取中文关键词(scala) 本篇博客在上一篇博 ...
Word2Vec原理及应用与文章相似度（推荐系统方法）
Word2Vec与文章相似度(推荐系统方法) 学习目标目标知道文章向量计算方式了解Word2Vec模型原理知道文章相似度计算方式应用应用Spark完成文章相似度计算 1 文章相似度在我们 ...
机器学习项目搭建试验 where2go
https://github.com/da248/where2go 这个项目感觉还是挺好的,虽然没给各个数据集的下载链接,也有一些莫名其妙的bug,但是错误调试提示都还挺全,能一直有进展. (看了下这 ...
Python黑马头条推荐系统第一天架构介绍和离线计算更新Item画像
Python黑马头条推荐系统项目课程定位.目标定位课程是机器学习(包含推荐算法)算法原理在推荐系统的实践深入推荐系统的业务流场景.工具使用作为人工智能的数据挖掘(推荐系统)方向应用项目目标 ...
电影推荐系统Sparrow Recsys源码解读
(慢慢补充中-) 来源:王喆老师的github https://github.com/wzhe06/SparrowRecSys Spark处理部分 Embedding 本部分代码的功能有: 制作用于模 ...
推荐系统[一]：超详细知识介绍，一份完整的入门指南，解答推荐系统相关算法流程、衡量指标和应用，以及如何使用jieba分词库进行相似推荐，业界广告推荐技术最新进展
搜索推荐系统专栏简介:搜索推荐全流程讲解(召回粗排精排重排混排).系统架构.常见问题.算法项目实战总结.技术细节以及项目实战(含码源) 专栏详细介绍:搜索推荐系统专栏简介:搜索推荐全流程讲解(召回粗排 ...
spark scala word2vec 和多层分类感知器在情感分析中的实际应用
转自:http://www.cnblogs.com/canyangfeixue/p/7227998.html 对于威胁检测算法使用神经网络训练有用!!!TODO待实验 /*** Created by ...
离线轻量级大数据平台Spark之MLib机器学习库Word2Vec实例
Word2Vecword2vec能将文本中出现的词向量化,可以在捕捉语境信息的同时压缩数据规模.Word2Vec实际上是两种不同的方法:Continuous Bag of Words (CBOW) 和 ...

Spark word2vec使用

deffindSynonyms(vector: Vector, num: Int): Array[(String, Double)]

deffindSynonyms(word: String, num: Int): Array[(String, Double)]

defgetVectors: Map[String, Array[Float]]

defsave(sc: SparkContext, path: String): Unit

deftransform(word: String): Vector

deftransform(dataset: Dataset[_]): DataFrame

Spark word2vec使用相关推荐

最新文章

热门文章