How to train Word2vec on very large datasets?

不羁岁月 提交于 2019-11-29 20:29:55

There are a number of opportunities to create Word2Vec models at scale. As you pointed out, candidate solutions are distributed (and/or multi-threaded) or GPU. This is not an exhaustive list but hopefully you get some ideas as to how to proceed.

Distributed / Multi-threading options:

  • Gensim uses Cython where it matters, and is equal to, or not much slower than C implementations. Gensim's multi-threading works well, and using a machine with ample memory and a large number of cores significantly decreases vector generation time. You may want to investigate using Amazon EC2 16 or 32-core instances.
  • Deepdist can utilize gensim and Spark to distribute gensim workloads across a cluster. Deepdist also has some clever SGD optimizations which synchronize gradient across nodes. If you use multi-core machines as nodes, you can take advantage of both clustering and multi-threading.

A number of Word2Vec GPU implementations exist. Given the large dataset size, and limited GPU memory you may have to consider a clustering strategy.

  • Bidmach is apparently very fast (documentation is however lacking, and admittedly I've struggled to get it working).
  • DL4J has a Word2Vec implementation but the team has yet to implement cuBLAS gemm and it's relatively slow vs CPUs.
  • Keras is a Python deep learning framework that utilizes Theano. While it does not implement word2vec per se, it does implement an embedding layer and can be used to create and query word vectors.

There are a number of other CUDA implementations of Word2Vec, at varying degrees of maturity and support:

I believe the SparkML team has recently got going a prototype cuBLAS-based Word2Vec implementation. You may want to investigate this.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!