SpaCy: how to load Google news word2vec vectors?

后端 未结 4 1645
面向向阳花
面向向阳花 2020-12-25 13:48

I\'ve tried several methods of loading the google news word2vec vectors (https://code.google.com/archive/p/word2vec/):

en_nlp = spacy.load(\'en\',vector=Fals         


        
4条回答
  •  独厮守ぢ
    2020-12-25 14:17

    For spacy 1.x, load Google news vectors into gensim and convert to a new format (each line in .txt contains a single vector: string, vec):

    from gensim.models.word2vec import Word2Vec
    from gensim.models import KeyedVectors
    model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
    model.wv.save_word2vec_format('googlenews.txt')
    

    Remove the first line of the .txt:

    tail -n +2 googlenews.txt > googlenews.new && mv -f googlenews.new googlenews.txt
    

    Compress the txt as .bz2:

    bzip2 googlenews.txt
    

    Create a SpaCy compatible binary file:

    spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
    

    Move the googlenews.bin to /lib/python/site-packages/spacy/data/en_google-1.0.0/vocab/googlenews.bin of your python environment.

    Then load the wordvectors:

    import spacy
    nlp = spacy.load('en',vectors='en_google')
    

    or load them after later:

    nlp.vocab.load_vectors_from_bin_loc('googlenews.bin')
    

提交回复
热议问题