how to calculate term-document matrix?

那年仲夏 提交于 2019-12-05 16:16:16

The output of a CountVectorizer().fit_transform() is a sparse matrix. It means that it will only store the non-zero elements of a matrix. When you do print(X), only the non-zero entries are displayed as you observe in the image.

As for how the calculation is done, you can have a look at the official documentation here.

The CountVectorizer in its default configuration, tokenize the given document or raw text (It will take only terms which have 2 or more characters in it) and count the word occurrences.

Basically, the steps are as follow:

  • Step1 - Collect all different terms from all the documents present in fit().

    For your data, they are [u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to'] This is available from vectorizer.get_feature_names()

  • Step2 - In the transform(), count the number of terms in each document which were present in the fit() output it in the term-frequency matrix.

    In your case, you are supplying both documents to transform() (fit_transform() is a shorthand for fit() and then transform()). So, the result is

    [u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']

First 1 1 1 1 1 0 1

Sec 0 1 1 0 0 1 0

You can get the above result by calling X.toarray().

In the image of the print(X) you posted, the first column represents the index of the term-freq matrix and second represents the frequencey of that term.

<0,0> means first row, first column i.e frequencies of term "disk" (first term in our tokens) in first document = 1

<0,2> means first row, third column i.e frequencies of term "hard" (third term in our tokens) in first document = 1

<0,5> means first row, sixth column i.e frequencies of term "problems" (sixth term in our tokens) in first document = 0. But since it is 0, it is not displayed in your image.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!