可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py  Traceback (most recent call last):   File "prog_w2v.py", line 7, in <module>     models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)   File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format     header = utils.to_unicode(fin.readline())   File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode     return unicode(text, encoding, errors=errors)   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

import gensim import time start = time.time()     models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)  end = time.time()    print end-start,"   seconds"

I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

回答1:

You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:

models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

回答2:

If you saved your model with save(), you must use load()

load_word2vec_format is for the model generated by google, not for the model generated by gensim

回答3:

If you save your model with:

model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')

Then load word2vec with load_word2vec_format method would cause the issue. To make it work you should use:

wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')

The same thing also happen when you save model with:

 model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)

And then, want to load with KeyedVectors.load method. In this situation, use:

wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)

文章来源: Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

标签

word2vec

gensim

Error: &#039;utf8&#039; codec can&#039;t decode byte 0x80 in position 0: invalid start byte

问题:

回答1:

回答2:

回答3:

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte