Python Gensim word2vec vocabulary key

陌路散爱 提交于 2019-12-24 07:57:42

问题


I want to make word2vec with gensim. I heard that vocabulary corpus should be unicode so I converted it to unicode.

# -*- encoding:utf-8 -*-
# !/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from gensim.models import Word2Vec
import pprint

with open('parsed_data.txt', 'r') as f:
    corpus = map(unicode, f.read().split('\n'))

model = Word2Vec(size=128, window=5, min_count=5, workers=4)
model.build_vocab(corpus,keep_raw_vocab=False)
model.train(corpus)
model.save('w2v')

pprint.pprint(model.most_similar(u'너'))

Above is my source code. It seems like work well. However there are problem with vocabulary key. I want to make korean word2vec which use unicode. For example word 사과 which means apology in english and it's unicode is \xC0AC\xACFC If I try to find 사과 in word2vec, key error occur...
Instead of \xc0ac\xacfc \xc0ac and \xacfc stores separately. What's the reason and how to solve it?


回答1:


Word2Vec requires text examples that are broken into word-tokens. It appears you are simply providing strings to Word2Vec, so when it iterates over them, it will only be seeing single-characters as words.

Does Korean use spaces to delimit words? If so, break your texts by spaces before handing the list-of-words as a text example to Word2Vec.

If not, you'll need to use some external word-tokenizer (not part of gensim) before passing your sentences to Word2Vec.



来源:https://stackoverflow.com/questions/43065843/python-gensim-word2vec-vocabulary-key

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!