How term frequency is calculated in TfidfVectorizer?

前端 未结 1 508
耶瑟儿~
耶瑟儿~ 2020-12-03 20:32

I searched a lot for understanding this but I am not able to. I understand that by default TfidfVectorizer will apply l2 normalization on term frequency. This a

相关标签:
1条回答
  • 2020-12-03 20:56

    Ok, Now lets go through the documentation I gave in comments step by step:

    Documents:

    `ખુબ વખાણ કરે છે
     ખુબ વધારે છે`
    
    1. Get all unique terms (features): ['કરે', 'ખુબ', 'છે.', 'વખાણ', 'વધારે']
    2. Calculate frequency of each term in documents:-

      a. Each term present in document1 [ખુબ વખાણ કરે છે] is present once, and વધારે is not present.`

      b. So the term frequency vector (sorted according to features): [1 1 1 1 0]

      c. Applying steps a and b on document2, we get [0 1 1 0 1]

      d. So our final term-frequency vector is [[1 1 1 1 0], [0 1 1 0 1]]

      Note: This is the term frequency you want

    3. Now find IDF (This is based on features, not on document basis):

      idf(term) = log(number of documents/number of documents with this term) + 1

      1 is added to the idf value to prevent zero divisions. It is governed by "smooth_idf" parameter which is True by default.

      idf('કરે') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
      
      idf('ખુબ') = log(2/2)+1 = 0 + 1 = 1
      
      idf('છે.') = log(2/2)+1 = 0 + 1 = 1
      
      idf('વખાણ') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
      
      idf('વધારે') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
      

      Note: This corresponds to the data you showed in question.

    4. Now calculate TF-IDF (This again is calculated document-wise, calculated according to sorting of features):

      a. For document1:

       For 'કરે', tf-idf = tf(કરે) x idf(કરે) = 1 x 1.69314 = 1.69314
      
       For 'ખુબ', tf-idf = tf(કરે) x idf(કરે) = 1 x 1 = 1
      
       For 'છે.', tf-idf = tf(કરે) x idf(કરે) = 1 x 1 = 1
      
       For 'વખાણ', tf-idf = tf(કરે) x idf(કરે) = 1 x 1.69314 = 1.69314
      
       For 'વધારે', tf-idf = tf(કરે) x idf(કરે) = 0 x 1.69314 = 0
      

      So for document1, the final tf-idf vector is [1.69314 1 1 1.69314 0]

      b. Now normalization is done (l2 Euclidean):

      dividor = sqrt(sqr(1.69314)+sqr(1)+sqr(1)+sqr(1.69314)+sqr(0))
               = sqrt(2.8667230596 + 1 + 1 + 2.8667230596 + 0)
               = sqrt(7.7334461192)
               = 2.7809074272977876...
      

      Dividing each element of the tf-idf array with dividor, we get:

      [0.6088445 0.3595948 0.3595948548 0.6088445 0]

      Note: This is the tfidf of firt document you posted in question.

      c. Now do the same steps a and b for document 2, we get:

      [ 0. 0.453294 0.453294 0. 0.767494]

    Update: About sublinear_tf = True OR False

    Your original term frequency vector is [[1 1 1 1 0], [0 1 1 0 1]] and you are correct in your understanding that using sublinear_tf = True will change the term frequency vector.

    new_tf = 1 + log(tf)
    

    Now the above line will only work on non zero elements in the term-frequecny. Because for 0, log(0) is undefined.

    And all your non-zero entries are 1. log(1) is 0 and 1 + log(1) = 1 + 0 = 1`.

    You see that the values will remain unchanged for elements with value 1. So your new_tf = [[1 1 1 1 0], [0 1 1 0 1]] = tf(original).

    Your term frequency is changing due to the sublinear_tf but it still remains the same.

    And hence all below calculations will be same and output is same if you use sublinear_tf=True OR sublinear_tf=False.

    Now if you change your documents for which the term-frequecy vector contains elements other than 1 and 0, you will get differences using the sublinear_tf.

    Hope your doubts are cleared now.

    0 讨论(0)
提交回复
热议问题