Spark MLib Word2Vec Error: The vocabulary size should be > 0

问题

I am trying to implement word vectorization using Spark's MLLib. I am following the example given here.

I have bunch of sentences which I want to give as input to train the model. But am not sure if this model takes sentences or just takes all the words as a sequence of string.

My input is as below:

scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ...

But when I try to train my word2vec model on this input it does not work.

scala> val word2vec = new Word2Vec()
word2vec: org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@51567040

scala> val model = word2vec.fit(v)
java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. You may need to check the setting of minCount, which could be large enough to remove all your words in sentences.

Does Word2Vec not take sentences as input?

回答1:

Your input is correct. However, Word2Vec will automatically remove words that do not occur a minimum number of times in the vocabulary (all sentences combined). By default this value is 5. In your case, it is highly likely that no word occurs 5 or more times in the data you use.

To change the minimum required word occurrences use setMinCount(), for example a min count of 2:

val word2vec = new Word2Vec().setMinCount(2)

来源：https://stackoverflow.com/questions/48086226/spark-mlib-word2vec-error-the-vocabulary-size-should-be-0

标签

scala

apache-spark

machine-learning

apache-spark-mllib

word2vec