Information Gain Calculation for a text file?

◇◆丶佛笑我妖孽 提交于 2019-12-04 05:27:36

问题


I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing(Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part.

my out file contain word and there TFIDF value.

like WORD - TFIDF VALUE

together(word) - 0.235(tfidf value)

come(word) - 0.2548(tfidf value)

when using weka for information gain ("InfoGainAttributeEval.java") it require .arff file format as input.

Is there any to convert text file into .arff format. or any other way to preform Information gain other than weka?

Is there any other open source for Calculating information gain for document ?


回答1:


I found my answer. In this we have to generate arff file.

In .arff file

@RELATION section will contain all words present in your whole document after preprocessing .Each word will be of type real because tfidf value is a real value.

@data section will contain their tfidf value calculated during preprocessing. for example first will contain tfidf value all words present in first document an at last colunm the document categary.

@RELATION filename
@ATTRIBUTE word1 real
@ATTRIBUTE word2 real
@ATTRIBUTE word3 real
.
.
.
.so on
@ATTRIBUTE class {cacm,cisi,cran,med}

@data
0.5545479562,0.27,0.554544479562,0.4479562,cacm
0.5545479562,0.27,0.554544479562,0.4479562,cacm
0.55454479562,0.1619617,0.579562,0.5542,cisi
0.5545479562,0.27,0.554544479562,0.4479562,cisi
0.0,0.2396113617,0.44479562,0.2,cran
0.5545479562,0.27,0.554544479562,0.4479562,carn
0.5545177444479562,0.26196113617,0.0,0.0,med
0.5545479562,0.27,0.554544479562,0.4479562,med

after you generate this file you can give this file as input to InfoGainAttributeEval.java. and this working for me.



来源:https://stackoverflow.com/questions/21063206/information-gain-calculation-for-a-text-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!