Keep the word frequency and inverse for one type of documents

匆匆过客 提交于 2021-01-29 20:47:08

问题


Code example to keep the term and inverse frequency:

library(dplyr)
library(janeaustenr)
library(tidytext)

book_words <- austen_books() %>%
    unnest_tokens(word, text) %>%
    count(book, word, sort = TRUE)

total_words <- book_words %>% 
    group_by(book) %>% 
    summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

book_words <- book_words %>%
    bind_tf_idf(word, book, n)

book_words %>%
    select(-total) %>%
    arrange(desc(tf_idf))

My problem is that this example uses multiple books.

I have different data structure:

dataset1 <- data.frame( anumber = c(1,2,3), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum", "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source."))

In my case of dataset1 every row is one unique document. I would like to have the same result of term and inverse term frq but I don't know how to make it using my option. How can I start?

Alternative option. From the calcuation of term frequency like this:

library(quanteda)
myDfm <- dataset1$text %>%
    corpus() %>%                    
    tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
    tokens_ngrams(n = 1:2) %>%
    dfm()

How is it possible to receive the same result as the one of tidytext, I mean of every word have the score of tf idf, using quanteda package?

What I tried

number_of_docs <- nrow(myDfm)
term_in_docs <- colSums(myDfm > 0)
idf <- log2(number_of_docs / term_in_docs)

# Compute TF

tf <- as.vector(myDfm)

# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(myDfm)
sort(tf_idf, decreasing = T)[1:5]

Is the right option to receive the tf_idf using quanteda for every word frequency?

to receive as output the word, term frequency, tf_idf value


回答1:


If I understand the question correctly, you want to get a tf-idf per word across your three different documents - in other words, an output data.frame that is unique by word.

The problem is that you cannot do this with tf-idf, because the "idf" part multiplies the term frequency by the log of the inverse document frequency. When you combine the three documents, then every term occurs in your single combined document, meaning it has a document frequency of 1, equal to the number of documents. So the tf-idf for every word of a combined document is zero. I've shown this below.

tf-idf is different for the same words within documents. That's why the tidytext example shows each word by book, not once for the whole corpus.

Here's how to do this in quanteda by document:

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.1

myDfm <- dataset1 %>%
  corpus(docid_field = "anumber") %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
  tokens_ngrams(n = 1:2) %>%
  dfm()

myDfm %>%
  dfm_tfidf() %>%
  convert(to = "data.frame") %>%
  dplyr::group_by(doc_id) %>%
  tidyr::gather(key = "word", value = "tf_idf", -doc_id) %>%
  tibble::tibble()
## # A tibble: 744 x 3
##    doc_id word   tf_idf
##    <chr>  <chr>   <dbl>
##  1 1      lorem   0    
##  2 2      lorem   0    
##  3 3      lorem   0    
##  4 1      ipsum   0    
##  5 2      ipsum   0    
##  6 3      ipsum   0    
##  7 1      is      0.176
##  8 2      is      0    
##  9 3      is      0.176
## 10 1      simply  0.176
## # … with 734 more rows

If you use dfm_group() to combine all documents, then you can see that the tf-idf is zero for all words.

myDfm %>%
  dfm_group(groups = rep(1, ndoc(myDfm))) %>%
  dfm_tfidf() %>%
  convert(to = "data.frame") %>%
  dplyr::select(-doc_id) %>%
  tidyr::gather(key = "word", value = "tf_idf") %>%
  tibble::tibble()
## # A tibble: 247 x 2
##    word     tf_idf
##    <chr>     <dbl>
##  1 lorem         0
##  2 ipsum         0
##  3 is            0
##  4 simply        0
##  5 dummy         0
##  6 text          0
##  7 of            0
##  8 the           0
##  9 printing      0
## 10 and           0
## # … with 237 more rows


来源:https://stackoverflow.com/questions/63449628/keep-the-word-frequency-and-inverse-for-one-type-of-documents

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!