In count vectorizer which axis to use?

≡放荡痞女 提交于 2020-03-25 05:52:10

问题


I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation.

The term count is important for me to create summarization using SVD in further steps.

My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood:

  • Axis=1 : Will give me the importance of the word in a sentence (column wise normalization)
  • Axis=0 : Importance of the word in a document (row wise normalization).

Even after knowing the theory I am not able to decide which alternative to choose because the choice will greatly affect my summarization results. So kindly guide me a solution along with a reason for the same.


回答1:


By L2 normalization, do you mean division by the total count? If you normalize along axis=0, then the value of x_{i,j} is the probability of the word j over all sentences i (division by the global word count), which is dependent on the length of the sentence, as longer ones can repeat some words over and over again and will have a much higher probability for this word, as they contribute a lot to the global word count. If you normalize along axis=1, then you're asking whether sentences have the same composition of words, as you normalize along the lenght of the sentence.



来源:https://stackoverflow.com/questions/60793533/in-count-vectorizer-which-axis-to-use

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!