问题
I want to train a word2vec model on a tokenized file of size 400MB. I have been trying to run this python code :
import operator
import gensim, logging, os
from gensim.models import Word2Vec
from gensim.models import *
class Sentences(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for line in open(self.filename):
yield line.split()
def runTraining(input_file,output_file):
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = Sentences(input_file)
model = gensim.models.Word2Vec(sentences, size=200)
model.save(output_file)
When I call this function on my file, I get this :
2017-10-23 17:57:00,211 : INFO : collecting all words and their counts
2017-10-23 17:57:04,071 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-10-23 17:57:16,116 : INFO : collected 4735816 word types from a corpus of 47054017 raw words and 1 sentences
2017-10-23 17:57:16,781 : INFO : Loading a fresh vocabulary
2017-10-23 17:57:18,873 : INFO : min_count=5 retains 290537 unique words (6% of original 4735816, drops 4445279)
2017-10-23 17:57:18,873 : INFO : min_count=5 leaves 42158450 word corpus (89% of original 47054017, drops 4895567)
2017-10-23 17:57:19,563 : INFO : deleting the raw counts dictionary of 4735816 items
2017-10-23 17:57:20,217 : INFO : sample=0.001 downsamples 34 most-common words
2017-10-23 17:57:20,217 : INFO : downsampling leaves estimated 35587188 word corpus (84.4% of prior 42158450)
2017-10-23 17:57:20,218 : INFO : estimated required memory for 290537 words and 200 dimensions: 610127700 bytes
2017-10-23 17:57:21,182 : INFO : resetting layer weights
2017-10-23 17:57:24,493 : INFO : training model with 3 workers on 290537 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-10-23 17:57:28,216 : INFO : PROGRESS: at 0.00% examples, 0 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:32,107 : INFO : PROGRESS: at 20.00% examples, 1314 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:36,071 : INFO : PROGRESS: at 40.00% examples, 1728 words/s, in_qsize 0, out_qsize 0
2017-10-23 17:57:41,059 : INFO : PROGRESS: at 60.00% examples, 1811 words/s, in_qsize 0, out_qsize 0
Killed
I know that word2vec needs a lot of space, but I still think there is a problem here. As you see the estimated memory for this model is of 600MB, while my computer has 16GB of RAM. Yet monitoring the process while the code runs shows that it occupies all of my memory and then gets killed.
As other posts advise I have tried to increase min_count and decrease size. But even with ridiculous values (min_count=50, size=10) the process stops at 60%.
I also tried to make python an exception to OOM so that the process doesn't get killed. When I do that, I have a MemoryError instead of the killing.
What is going on ?
(I use a recent laptop with Ubuntu 17.04, 16GB RAM and a Nvidia GTX 960M. I run python 3.6 from Anaconda and gensim 3.0, but it does'nt do better with gensim 2.3)
回答1:
Your file is a single line, as indicated by the log output:
2017-10-23 17:57:16,116 : INFO : collected 4735816 word types from a corpus of 47054017 raw words and 1 sentences
It is doubtful that this is what you want; in particular the optimized cython code in gensim's Word2Vec
can only handle sentences of 10,000 words before truncating them (and discarding the rest). So most of your data isn't being considered during training (even if it were to finish).
But the bigger problem is that single 47-million-word line will come into memory as one gigantic string, then be split()
into a 47-million-entry list-of-strings. So your attempt to use a memory-efficient iterator isn't helping any – the full file is being brought into memory, twice over, for a single 'iteration'.
I still don't see that using a full 16GB RAM, but perhaps correcting that will resolve the issue, or make whatever remaining issues more evident.
If your tokenized data doesn't have natural line breaks around or below the 10,000-token sentence length, you can look how the example corpus class LineSentence
, included in gensim to be able to work on the (also missing line breaks) text8
or text9
corpuses, limits each yielded sentence to 10,000 tokens:
https://github.com/RaRe-Technologies/gensim/blob/58b30d71358964f1fc887477c5dc1881b634094a/gensim/models/word2vec.py#L1620
(It may not be a contributing factor but you may also want to use the with
context-manager to ensure your open()
ed file is promptly closed after the iterator is exhausted.)
来源:https://stackoverflow.com/questions/46899062/gensim-word2vec-uses-too-much-memory