Spark Word2vec vector mathematics

匿名 (未验证) 提交于 2019-12-03 01:48:02

问题:

I was looking at the example of Spark site for Word2Vec:

val input = sc.textFile("text8").map(line => line.split(" ").toSeq)  val word2vec = new Word2Vec()  val model = word2vec.fit(input)  val synonyms = model.findSynonyms("country name here", 40) 

How do I do the interesting vector such as king - man + woman = queen. I can use model.getVectors, but not sure how to proceed further.

回答1:

Here is an example in pyspark, which I guess is straightforward to port to Scala - the key is the use of model.transform.

First, we train the model as in the example:

from pyspark import SparkContext from pyspark.mllib.feature import Word2Vec  sc = SparkContext() inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))  k = 220         # vector dimensionality word2vec = Word2Vec().setVectorSize(k) model = word2vec.fit(inp) 

k is the dimensionality of the word vectors - the higher the better (default value is 100), but you will need memory, and the highest number I could go with my machine was 220. (EDIT: Typical values in the relevant publications are between 300 and 1000)

After we have trained the model, we can define a simple function as follows:

def getAnalogy(s, model):     qry = model.transform(s[0]) - model.transform(s[1]) - model.transform(s[2])     res = model.findSynonyms((-1)*qry,5) # return 5 "synonyms"     res = [x[0] for x in res]     for k in range(0,3):         if s[k] in res:             res.remove(s[k])     return res[0] 

Now, here are some examples with countries and their capitals:

s = ('france', 'paris', 'portugal') getAnalogy(s, model) # u'lisbon'  s = ('china', 'beijing', 'russia') getAnalogy(s, model) # u'moscow'  s = ('spain', 'madrid', 'greece') getAnalogy(s, model) # u'athens'  s = ('germany', 'berlin', 'portugal') getAnalogy(s, model) # u'lisbon'  s = ('japan', 'tokyo', 'sweden') getAnalogy(s, model)     # u'stockholm'  s = ('finland', 'helsinki', 'iran') getAnalogy(s, model) # u'tehran'  s = ('egypt', 'cairo', 'finland') getAnalogy(s, model) # u'helsinki' 

The results are not always correct - I'll leave it to you to experiment, but they get better with more training data and increased vector dimensionality k.

The for loop in the function removes entries that belong to the input query itself, as I noticed that frequently the correct answer was the second one in the returned list, with the first usually being one of the input terms.



回答2:

val w2v_map = sameModel.getVectors//this gives u a map {word:vec}

  val (king, man, woman) = (w2v_map.get("king").get, w2v_map.get("man").get, w2v_map.get("women").get)    val n = king.length    //daxpy(n: Int, da: Double, dx: Array[Double], incx: Int, dy: Array[Double], incy: Int);   blas.saxpy(n,-1,man,1,king,1)    blas.saxpy(n,1,woman,1,king,1)    val vec = new DenseVector(king.map(_.toDouble))    val most_similar_word_to_vector = sameModel.findSynonyms(vec, 10) //they have an api to get synonyms for word, and one for vector   for((synonym, cosineSimilarity) 

and the running result as blow:

women 0.628454885964967 philip 0.5539534290356802 henry 0.5520055707837214 vii 0.5455116413024774 elizabeth 0.5290994886254643 queen 0.5162519562606844 men 0.5133851770249461 wenceslaus 0.5127030522678778 viii 0.5104392579985102 eldest 0.510425791249559



回答3:

Here is the pseudo code. For the full implementation, read the documentation: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/feature/Word2VecModel.html

  1. w2v_map = model.getVectors() # this gives u a map {word:vec}
  2. my_vector = w2v_map.get('king') - w2v_map.get('man') + w2v_map.get('queen') # do vector algebra here
  3. most_similar_word_to_vector = model.findSynonyms(my_vector, 10) # they have an api to get synonyms for word, and one for vector

edit: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/feature/Word2VecModel.html#findSynonyms(org.apache.spark.mllib.linalg.Vector,%20int)



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!