TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()

十年热恋 提交于 2019-12-11 15:26:32

问题


There is a dataframe like this:

  index  terms   
  1345  ['jays', 'place', 'great', 'subway']    
  1543  ['described', 'communicative', 'friendly']    
  9874  ['great', 'sarahs', 'apartament', 'back']    
  2456  ['great', 'sarahs', 'apartament', 'back']  

I try to create a dictionary from the corpus of comments[ 'terms' ], but I face an error message !

from gensim import corpora, models
dictionary = corpora.Dictionary( comments['terms'] )

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

回答1:


Each index needs to have its terms be in a sublist, all of which are nested within larger list.

theterms = [['jays', 'place', 'great', 'subway'],['described', 'communicative', 'friendly'], ['great', 'sarahs', 'apartament', 'back'],['great', 'sarahs', 'apartament', 'back']] 

dictionary = corpora.Dictionary(theterms)



回答2:


First convert comments['terms'] using comments['terms'].tolist() to a list and then run the corpora, it should work. You can do other preprocessing like stemming or stopwords removal etc. before creating your dictionary.



来源:https://stackoverflow.com/questions/44352552/typeerror-doc2bow-expects-an-array-of-unicode-tokens-on-input-not-a-single-str

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!