Print first line of one element of Corpus in R using tm package

我的梦境 提交于 2019-12-23 02:52:44

问题


How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal.

# Load Libraries
library(tm)

# Read in Corpus
corp <- SimpleCorpus( DirSource( 
    "C:/TextDocument"))

# Remove puncuation
corp <- removePunctuation(corp,
                      preserve_intra_word_contractions = TRUE,
                      preserve_intra_word_dashes = TRUE)

I have tried accessing the corpus several ways:

# Print first line of first element of corpus
corp[[1]][[1]] 

# Print first line using 'content' element of corpus
corp[[1]]$content[[1]]

Both of these result in very long run times without the desired output.

The crude corpus in the tm package can be used for example purposes.

data("crude")

回答1:


strwrap does this job nicely since it prints your paragraphs formatted by breaking lines at word boundaries. (See ?strwrap.) Then you can use the head function to see the first 6 lines.

 head(strwrap(corp))


来源:https://stackoverflow.com/questions/49951708/print-first-line-of-one-element-of-corpus-in-r-using-tm-package

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!