countvectorizer

dimension mismatch error in CountVectorizer MultinomialNB

有些话、适合烂在心里 提交于 2019-12-02 04:17:23
问题 Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and

dimension mismatch error in CountVectorizer MultinomialNB

对着背影说爱祢 提交于 2019-12-01 22:53:46
Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and predict on test set. Here is my code (simplified): from sklearn.feature_extraction.text import

Scala Spark - split vector column into separate columns in a Spark DataFrame

爷,独闯天下 提交于 2019-11-29 11:53:42
I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn") , each corresponds to one element in the vector. some_columns... | Features ... | [0,1,0,..., 0] to some_columns... | f1 | f2 | f3 | ... | fn ... | 0 | 1 | 0 | ... | 0 What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use