问题
I am trying to classify based on the bag-of-words model from NLP.
- Did pre-processing of the train data using NLTK (punctuation, stop words removal, lower case, stemming etc.)
- Created tf-idf matrix for train.
- Did pre-processing of test.
- Created tf-idf matrix for test data.
- Train and Test data have different bag of words so the no of features are different, so we cannot use a classification algo like knn.
- I merged the train and test data together and created the tf-idf matrix. This solved the above problem of different bag of words. But the resultant matrix was too huge to process.
Here are my questions:
- Is there a way to create the exact bag of words for train and test?
- If there is not and my approach of adding train and test is correct, should I go for a dimensionality reduction algo like LDA?
回答1:
You may use the scikit learn's count vectorizer to first create vectors for given words in the document, use it to train a classifier of your choice and then use the classifer to test your data.
For the training set, you can use the vectorizer to train the data as follows:
LabeledWords=pd.DataFrame(columns=['word','label'])
LabeledWords.append({'word':'Church','label':'Religion'} )
vectorizer = CountVectorizer()
Xtrain,yTrain=vectorizer.fit_transform(LabeledWords['word']).toarray(),vectorizer.fit_transform(LabeledWords['label']).toarray()
You can then train the classifier of your choice with the above vectorizer like:
forest = RandomForestClassifier(n_estimators = 100)
clf=forest.fit(Xtrain,yTrain)
In order to test your data:
for each_word,label in Preprocessed_list:
test_featuresX.append(vectorizer.transform(each_word),toarray())
test_featuresY.append(label.toarray())
clf.score(test_featuresX,test_featuresY)
来源:https://stackoverflow.com/questions/44978782/how-to-standardize-the-bag-of-words-for-train-and-test