UnicodeDecodeError error when loading word2vec

夙愿已清 提交于 2020-01-24 15:11:04

问题


Full Description

I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese. Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word Embeddings which led me to Kyubyong's WordVectors from which I learned about Rami Al-Rfou's Polyglot. After downloading both, I unsuccessfully have been trying to simply load the word vectors.

Short Description

I can't load pre-trained word vectors; I am trying WordVectors and Polyglot.

Downloads

  • Kyubyong's pre-trained word2vector format word vectors for Portuguese;
  • Polyglot's pre-trained word vectors for Portuguese;

Loading attempts

Kyubyong's WordVectors First attempt: using Gensim as suggested by Hirosan;

from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)

And the error returned:

[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

The zip downloaded also contains other files but all of them return similar errors.

Polyglot First attempt: following Al-Rfous's instructions;

import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))

And the error returned:

File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
    words, embeddings = pickle.load(open(polyglot_path, "rb"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)

Second attempt: using Polyglot's word embedding load function;

First, we have to install polyglot via pip:

pip install polyglot

Now we can import it:

from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)

And the error returned:

File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Extra Information

I am using python 3 on MacOS High Sierra.

Solutions

Kyubyong's WordVectors As pointed out by Aneesh Joshi, the correct way to load Kyubyong's model is by calling the native load function of Word2Vec.

from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)

Even though I am more than grateful for Aneesh Joshi solution, polyglot seems to be a better model for working with Portuguese. Any ideas about that one?


回答1:


For Kyubyong's pre-trained word2vector .bin file: it may have been saved using gensim's save function.

"load the model with load(). Not load_word2vec_format (that's for the C-tool compatibility)."

i.e., model = Word2Vec.load(fname)

Let me know if that works.

Reference : Gensim mailing list



来源:https://stackoverflow.com/questions/50573054/unicodedecodeerror-error-when-loading-word2vec

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!