I am learning Doc2Vec
model from gensim
library and using it as follows:
class MyTaggedDocument(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin: print(fname) for item_no, sentence in enumerate(fin): yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]) sentences = MyTaggedDocument(dirname) model = Doc2Vec(sentences,min_count=2, window=10, size=300, sample=1e-4, negative=5, workers=7)
The input dirname
is a directory path which has , for the sake of simplicity, only 2 files located with each file containing more than 100 lines. I am getting following Exception.
Also, with print
statement I could see that the iterator iterated over directory 6 times. Why is this so?
Any kind of help would be appreciated.