'utf-8' decode error when loading a word2vec module

不羁岁月 提交于 2019-12-07 04:07:09

问题


I have to use a word2vec module containing tons of Chinese characters. The module was trained by my coworkers using Java and is saved as a bin file.

I installed gensim and tries to load the module, but following error occurred:

In [1]: import gensim  

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

I tried to load the module both in python 2.7 and 3.5, failed in the same way. So how can I load the module in gensim? Thanks.


回答1:


The module was tons of Chinese characters trained by Java. I cannot figure out the encoding format of the original corpus. The error can be solved as the description in gensim FAQ,

Using load_word2vec_format with a flag for ignoring the character decoding errors:

In [1]: import gensim

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True, unicode_errors='ignore')

But I've no idea whether it matters when ignoring the encoding errors.




回答2:


I have tried the flag

unicode_errors='ignore'

but it does not solve the unicode problem.

I checked that I got the unicode error after I rename the file from filename.bin.gz to filename.gz.

My solution is to extract the compressed file, instead of renaming it.

Then I use the file with the flag above and there is no unicode error.

Note that I use Mac (Sierra) with python 2.7.



来源:https://stackoverflow.com/questions/34427678/utf-8-decode-error-when-loading-a-word2vec-module

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!