What are the steps needed to use Mahout Native Bayes Classifier Algorithm?

孤街浪徒 提交于 2019-12-04 16:40:30

In order to test your data, you need to make sure your training set has some labels or has been divided into chunks based on some features that you used in your data collection set. I am unsure how you have organized your data, but you need to split your data set into chunks of similar features together.

Once you have created your splits based on your criteria, check the creation of your input data. You can verify files using:

hadoop fs -ls filename

Train your classifier using:

$MAHOUT_HOME/bin/mahout trainclassifier -i input_file -o output_model

Test the classifier using:

$MAHOUT_HOME/bin/mahout testclassifier -m output_model -d input_file 

NOTE: Please note that during data collection you need to make sure you assign weights for certain data values, if they exist. Also data cleaning has to be done for normalizing error during the experimental setup or data collection. You can use any multiplicative scatter correction techniques for your data set to correct it.

Firstly, have a file called training-categories.txt, that contains the categories for your classifier. You can use a simple text editor to do this.

Now that we have a list of categories we’re interested in, run the ExtractTrainingData class using the category list.

$TT_HOME/bin/tt extractTrainingData \
--dir ./index \
--categories ./training-categories.txt \
--output ./category-bayes-data \
--category-fields categoryFacet,source \
--text-fields title,description \
--tv

This command will read documents and search for matching categories in the category and source fields. When one of the categories listed in training-categories.txt is found in one of these documents, the terms will be extracted from term vectors stored in the title and description fields. These terms will be written to a file in the category-bayes-data directory. There will be a single file for each category. Each is a plain text file that can be viewed with any text editor or display utility.

The category name appears in the first column, while each of the terms that appear in the document is contained in the second column. The Mahout Bayes classifiers expect the input fields to be stemmed, so you will see this reflected in the test data. The --tv argument to the extractTraining data command causes the stemmed terms from each document’s term vector to be used.

When the ExtractTrainingData class has completed its run it will output a count of documents found in each category.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!