Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

匿名 (未验证) 提交于 2019-12-03 08:44:33

问题:

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py  Traceback (most recent call last):   File "prog_w2v.py", line 7, in <module>     models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)   File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format     header = utils.to_unicode(fin.readline())   File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode     return unicode(text, encoding, errors=errors)   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte 

I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

import gensim import time start = time.time()     models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)  end = time.time()    print end-start,"   seconds" 

I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

回答1:

You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:

models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt') 


回答2:

If you saved your model with save(), you must use load()

load_word2vec_format is for the model generated by google, not for the model generated by gensim



回答3:

If you save your model with:

model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin') 

Then load word2vec with load_word2vec_format method would cause the issue. To make it work you should use:

wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin') 

The same thing also happen when you save model with:

 model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False) 

And then, want to load with KeyedVectors.load method. In this situation, use:

wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!