How to standardize the bag of words for train and test?

此生再无相见时 提交于 2019-12-24 18:57:00

问题


I am trying to classify based on the bag-of-words model from NLP.

  1. Did pre-processing of the train data using NLTK (punctuation, stop words removal, lower case, stemming etc.)
  2. Created tf-idf matrix for train.
  3. Did pre-processing of test.
  4. Created tf-idf matrix for test data.
  5. Train and Test data have different bag of words so the no of features are different, so we cannot use a classification algo like knn.
  6. I merged the train and test data together and created the tf-idf matrix. This solved the above problem of different bag of words. But the resultant matrix was too huge to process.

Here are my questions:

  1. Is there a way to create the exact bag of words for train and test?
  2. If there is not and my approach of adding train and test is correct, should I go for a dimensionality reduction algo like LDA?

回答1:


You may use the scikit learn's count vectorizer to first create vectors for given words in the document, use it to train a classifier of your choice and then use the classifer to test your data.

For the training set, you can use the vectorizer to train the data as follows:

 LabeledWords=pd.DataFrame(columns=['word','label'])

 LabeledWords.append({'word':'Church','label':'Religion'} )

 vectorizer = CountVectorizer()

 Xtrain,yTrain=vectorizer.fit_transform(LabeledWords['word']).toarray(),vectorizer.fit_transform(LabeledWords['label']).toarray()

You can then train the classifier of your choice with the above vectorizer like:

forest = RandomForestClassifier(n_estimators = 100) 
clf=forest.fit(Xtrain,yTrain)

In order to test your data:

for each_word,label in Preprocessed_list:
    test_featuresX.append(vectorizer.transform(each_word),toarray())
    test_featuresY.append(label.toarray())
clf.score(test_featuresX,test_featuresY) 


来源:https://stackoverflow.com/questions/44978782/how-to-standardize-the-bag-of-words-for-train-and-test

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!