How to get cython and gensim to work with pyspark

…衆ロ難τιáo~ 提交于 2019-12-12 02:15:34

问题


I'm running a Lubuntu 16.04 Machine with gcc installed. I'm not getting gensim to work with cython because when I train a doc2vec model, it is only ever trained with one worker which is dreadfully slow.

As I said gcc was installed from the start. I then maybe made the mistake and installed gensim before cython. I corrected that by forcing a reinstall of gensim via pip. With no effect still just one worker.

The machine is setup as a spark master and I interface with spark via pyspark. It works something like this, pyspark uses jupyter and jupyter uses python 3.5. This way I get a jupyter interface to my cluster. Now I have no idea if this is the reason why i cant get gensim to work with cython. I don't execute any gensim code on the cluster, it is just more convenient to fire up jupyter to also do gensim.


回答1:


After digging deeper and trying things like loading the whole corpus into memory executing gensim in a different environment etc. all with no effect. It seems it is a problem with gensim that the code is only partial parallelized. This results in the workers not being able to fully utilize the CPU. See the issues on github link.




回答2:


You probably did this, but could you please check that you are using the parallel Cythonised version by assert gensim.models.doc2vec.FAST_VERSION > -1 ?

The gensim doc2vec code is parallelized but unfortunately the I/O code that is outside of Gensim isn't. For example, in the github issue you linked parallelization is indeed achieved after the corpus is loaded into RAM by doclist = [doc for doc in documents]



来源:https://stackoverflow.com/questions/42039964/how-to-get-cython-and-gensim-to-work-with-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!