How to load sentences into Python gensim?

孤者浪人 提交于 2019-12-03 02:20:46

A list of utf-8 sentences. You can also stream the data from the disk.

Make sure it's utf-8, and split it:

sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)

Like alKid pointed out, make it utf-8.

Talking about two additional things you might have to worry about.

  1. Input is too large and you're loading it from a file.
  2. Removing stop words from the sentences.

Instead of loading a big list into the memory, you can do something like:

import nltk, gensim
class FileToSent(object):    
    def __init__(self, filename):
        self.filename = filename
        self.stop = set(nltk.corpus.stopwords.words('english'))

    def __iter__(self):
        for line in open(self.filename, 'r'):
        ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
        yield ll

And then,

sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!