How to train Word2vec on very large datasets?

I am thinking of training word2vec on huge large scale data of more than 10 TB+ in size on web crawl dump.

I personally trained c implementation GoogleNews-2012 dump (1.5gb) on my iMac took about 3 hours to train and generate vectors (impressed with speed). I did not try python implementation though :( I read somewhere that generating vectors on wiki dump (11gb) of 300 vector length takes about 9 days to generate.

How to speed up word2vec? Do i need to use distributed models or what type of hardware i need to do it within 2-3 days? i have iMac with 8gb ram.
Which one is faster? Gensim python or C implemention?

I see that word2vec implementation does not support GPU training.

There are a number of opportunities to create Word2Vec models at scale. As you pointed out, candidate solutions are distributed (and/or multi-threaded) or GPU. This is not an exhaustive list but hopefully you get some ideas as to how to proceed.

Distributed / Multi-threading options:

Gensim uses Cython where it matters, and is equal to, or not much slower than C implementations. Gensim's multi-threading works well, and using a machine with ample memory and a large number of cores significantly decreases vector generation time. You may want to investigate using Amazon EC2 16 or 32-core instances.
Deepdist can utilize gensim and Spark to distribute gensim workloads across a cluster. Deepdist also has some clever SGD optimizations which synchronize gradient across nodes. If you use multi-core machines as nodes, you can take advantage of both clustering and multi-threading.

A number of Word2Vec GPU implementations exist. Given the large dataset size, and limited GPU memory you may have to consider a clustering strategy.

Bidmach is apparently very fast (documentation is however lacking, and admittedly I've struggled to get it working).
DL4J has a Word2Vec implementation but the team has yet to implement cuBLAS gemm and it's relatively slow vs CPUs.
Keras is a Python deep learning framework that utilizes Theano. While it does not implement word2vec per se, it does implement an embedding layer and can be used to create and query word vectors.

There are a number of other CUDA implementations of Word2Vec, at varying degrees of maturity and support:

https://github.com/whatupbiatch/cuda-word2vec [memory mgmt looks great, though non-existant documentation on how to create datasets]
https://github.com/fengChenHPC/word2vec_cbow [super-fast, but GPU memory issues on large datasets]

I believe the SparkML team has recently got going a prototype cuBLAS-based Word2Vec implementation. You may want to investigate this.

来源：https://stackoverflow.com/questions/30573873/how-to-train-word2vec-on-very-large-datasets

标签

python

machine-learning

word2vec