Spark MLib Word2Vec Error: The vocabulary size should be > 0

荒凉一梦 提交于 2019-12-11 00:18:39

问题


I am trying to implement word vectorization using Spark's MLLib. I am following the example given here.

I have bunch of sentences which I want to give as input to train the model. But am not sure if this model takes sentences or just takes all the words as a sequence of string.

My input is as below:

scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ...

But when I try to train my word2vec model on this input it does not work.

scala> val word2vec = new Word2Vec()
word2vec: org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@51567040

scala> val model = word2vec.fit(v)
java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. You may need to check the setting of minCount, which could be large enough to remove all your words in sentences.

Does Word2Vec not take sentences as input?


回答1:


Your input is correct. However, Word2Vec will automatically remove words that do not occur a minimum number of times in the vocabulary (all sentences combined). By default this value is 5. In your case, it is highly likely that no word occurs 5 or more times in the data you use.

To change the minimum required word occurrences use setMinCount(), for example a min count of 2:

val word2vec = new Word2Vec().setMinCount(2)


来源:https://stackoverflow.com/questions/48086226/spark-mlib-word2vec-error-the-vocabulary-size-should-be-0

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!