Create a term frequency matrix using 2 columns from a csv file, in R?

谁说我不能喝 提交于 2019-12-06 12:31:41

问题


I'm new to R. I'm mining data which is present in csv file - summaries of reports in one column, date of report in another column and report's agency in the thrid column. I need to investigate how terms associated with ‘fraud’ have changed over time or vary by agency. I've filtered the rows containing the term 'fraud' and created a new csv file.

How can I create a term freq matrix with years as rows and terms as columns so that I can look for top freq terms and do some clustering?

Basically, I need to create a term frequency matrix of terms against year

Input data: (csv)
**Year**    **Summary** (around 300 words each)    
1945             <text>
1985             <text>
2011             <text>

Desired 0utput : (Term frequency matrix)

       term1     term2    term3  term4 .......
1945     3         5        7       8 .....
1985     1         2        0       7  .....
2011      .            .   .    

Any help would be greatly appreciated.

回答1:


In the future please provide a minimal working example.

This isn't exactly using tm but qdap instead as it fits your data type better:

library(qdap)
#create a fake data set (please do this in the future yourself) 
dat <- data.frame(year=1945:(1945+10), summary=DATA$state) 

##    year                               summary
## 1  1945         Computer is fun. Not too fun.
## 2  1946               No it's not, it's dumb.
## 3  1947                    What should we do?
## 4  1948                  You liar, it stinks!
## 5  1949               I am telling the truth!
## 6  1950                How can we be certain?
## 7  1951                      There is no way.
## 8  1952                       I distrust you.
## 9  1953           What are you talking about?
## 10 1954         Shall we move on?  Good then.
## 11 1955 I'm hungry.  Let's eat.  You already?

Now to create the word frequency matrix (similar to a term document matrix):

t(with(dat, wfm(summary, year)))

##      about already am are be ... you
## 1945     0       0  0   0  0       0
## 1946     0       0  0   0  0       0
## 1947     0       0  0   0  0       0
## 1948     0       0  0   0  0       1
## 1949     0       0  1   0  0       0
## 1950     0       0  0   0  1       0
## 1951     0       0  0   0  0       0
## 1952     0       0  0   0  0       1
## 1953     1       0  0   1  0       1
## 1954     0       0  0   0  0       0
## 1955     0       1  0   0  0       1

Or you can create a tru DocumentTermMatrix as of qdap version 1.1.0:

with(dat, dtm(summary, year))

## > with(dat, dtm(summary, year))
## A document-term matrix (11 documents, 41 terms)
## 
## Non-/sparse entries: 51/400
## Sparsity           : 89%
## Maximal term length: 8 
## Weighting          : term frequency (tf)


来源:https://stackoverflow.com/questions/16677292/create-a-term-frequency-matrix-using-2-columns-from-a-csv-file-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!