How to standardize the bag of words for train and test?

问题

I am trying to classify based on the bag-of-words model from NLP.

Did pre-processing of the train data using NLTK (punctuation, stop words removal, lower case, stemming etc.)
Created tf-idf matrix for train.
Did pre-processing of test.
Created tf-idf matrix for test data.
Train and Test data have different bag of words so the no of features are different, so we cannot use a classification algo like knn.
I merged the train and test data together and created the tf-idf matrix. This solved the above problem of different bag of words. But the resultant matrix was too huge to process.

Here are my questions:

Is there a way to create the exact bag of words for train and test?
If there is not and my approach of adding train and test is correct, should I go for a dimensionality reduction algo like LDA?

回答1:

You may use the scikit learn's count vectorizer to first create vectors for given words in the document, use it to train a classifier of your choice and then use the classifer to test your data.

For the training set, you can use the vectorizer to train the data as follows:

 LabeledWords=pd.DataFrame(columns=['word','label'])

 LabeledWords.append({'word':'Church','label':'Religion'} )

 vectorizer = CountVectorizer()

 Xtrain,yTrain=vectorizer.fit_transform(LabeledWords['word']).toarray(),vectorizer.fit_transform(LabeledWords['label']).toarray()

You can then train the classifier of your choice with the above vectorizer like:

forest = RandomForestClassifier(n_estimators = 100) 
clf=forest.fit(Xtrain,yTrain)

In order to test your data:

for each_word,label in Preprocessed_list:
    test_featuresX.append(vectorizer.transform(each_word),toarray())
    test_featuresY.append(label.toarray())
clf.score(test_featuresX,test_featuresY)

来源：https://stackoverflow.com/questions/44978782/how-to-standardize-the-bag-of-words-for-train-and-test

标签

nlp

nltk