More efficient means of creating a corpus and DTM with 4M rows

前端未结

关注

 4  2099

予麋鹿 2020-12-12 21:18

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.

4条回答

[愿得一人] (楼主)

2020-12-12 21:48
You have a few choices. @TylerRinker commented about qdap, which is certainly a way to go.

Alternatively (or additionally) you could also benefit from a healthy does of parallelism. There's a nice CRAN page detailing HPC resources in R. It's a bit dated though and the multicore package's functionality is now contained within parallel.

You can scale up your text mining using the multicore apply functions of the parallel package or with cluster computing (also supported by that package, as well as by snowfall and biopara).

Another way to go is to employ a MapReduce approach. A nice presentation on combining tm and MapReduce for big data is available here. While that presentation is a few years old, all of the information is still current, valid and relevant. The same authors have a newer academic article on the topic, which focuses on the tm.plugin.dc plugin. To get around having a Vector Source instead of DirSource you can use coercion:
```
data("crude")
as.DistributedCorpus(crude)
```
If none of those solutions fit your taste, or if you're just feeling adventurous, you might also see how well your GPU can tackle the problem. There's a lot of variation in how well GPUs perform relative to CPUs and this may be a use case. If you'd like to give it a try, you can use gputools or the other GPU packages mentioned on the CRAN HPC Task View.

Example:
```
library(tm)
install.packages("tm.plugin.dc")
library(tm.plugin.dc)

GetDCorpus <-function(textVector)
{
  doc.corpus <- as.DistributedCorpus(VCorpus(VectorSource(textVector)))
  doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
  doc.corpus <- tm_map(doc.corpus, content_transformer(removeNumbers))
  doc.corpus <- tm_map(doc.corpus, content_transformer(removePunctuation))
  # <- tm_map(doc.corpus, removeWords, stopwords("english")) # won't accept this for some reason...
  return(doc.corpus)
}

data <- data.frame(
  c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)

dcorp <- GetDCorpus(data[,1])

tdm <- TermDocumentMatrix(dcorp)

inspect(tdm)
```
Output:
```
> inspect(tdm)
<>
Non-/sparse entries: 10/20
Sparsity           : 67%
Maximal term length: 7
Weighting          : term frequency (tf)

         Docs
Terms     1 2 3
  barred  0 1 0
  big     1 0 0
  child   0 0 1
  dogs    1 0 0
  holds   0 1 0
  honor   0 0 1
  hunt    1 0 0
  let     1 0 0
  student 0 0 1
  the     1 0 0
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...