run perceptron algorithm on a hash map feature vecteur: java

前端未结

关注

 2  1715

轮回少年 2021-01-17 07:52

I have the following code, it reads in many files from a directory into a hash map, this is my feature vecteur. It\'s somewhat naive in the sense that it does no st

2条回答

青春惊慌失措 (楼主)

2021-01-17 08:19
Let's establish some vocabulary up front (I guess you are using the 20-newsgroup dataset):
- "Class Label" is what you're trying to predict, in your binary case this is "atheism" vs. the rest
- "Feature vector" that's what you input to your classifier
- "Document" that is a single e-mail from the dataset
- "Token" a fraction of a document, usually a unigram/bigram/trigram
- "Dictionary" a set of "allowed" words for your vector
So the vectorization algorithm for bag of words usually follows the following steps:
1. Go over all the documents (across all class labels) and collect all the tokens, this is your dictionary and the dimensionality of your feature vector
2. Go over all the documents again and for each do:
  1. Create a new feature vector with the dimensionality of your dictionary (e.g. 200, for 200 entries in that dictionary)
  2. go over all the tokens in that document and set the word count (within this document) at this dimension of the feature vector
3. You now have a list of feature vectors that you can feed into your algorithm
Example:
```
Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]
```
Dictionary is:
```
["I", "am", "awesome", "great"]
```
So the documents as a vector would look like:
```
Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]
```
And with that you can do all kinds of fancy math stuff and feed this into your perceptron.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...