How to combine n-grams into one vocabulary in Spark?
问题 Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation of CountVectorizer results in a dictionary containing only 2-grams. What I really want is to combine all frequent 1-grams, 2-grams, etc into one dictionary for my corpus. 回答1: You can train separate NGram and CountVectorizer models and merge using VectorAssembler . from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler from