I have the following code, it reads in many files from a directory into a hash map, this is my feature vecteur. It\'s somewhat naive in the sense that it does no st
Let's establish some vocabulary up front (I guess you are using the 20-newsgroup dataset):
So the vectorization algorithm for bag of words usually follows the following steps:
Example:
Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]
Dictionary is:
["I", "am", "awesome", "great"]
So the documents as a vector would look like:
Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]
And with that you can do all kinds of fancy math stuff and feed this into your perceptron.