Spark and Python trying to parse wikipedia using gensim
问题 Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and then using my or from some library custom functions. Now I am particularly trying to parse the wikipedia dump using gensim framework. I have already installed gensim on my master node and all my worker nodes and now I would like to use gensim build in function for parsing wikipedia pages inspired by this question List