Sparse vector RDD in pyspark
问题 I have been implementing the TF-IDF method described here with Python/Pyspark using feature from mllib: https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html I have a training set of 150 text documents, a testing set of 80 text documents. I have produced a hash table TF-IDF RDD (of sparse vectors) for both training and testing i.e. bag of words representation called tfidf_train and tfidf_test. The IDF is shared between both and is based solely on the training data. My question