How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal.
# Load Libraries
library(tm)
# Read in Corpus
corp <- SimpleCorpus( DirSource(
"C:/TextDocument"))
# Remove puncuation
corp <- removePunctuation(corp,
preserve_intra_word_contractions = TRUE,
preserve_intra_word_dashes = TRUE)
I have tried accessing the corpus several ways:
# Print first line of first element of corpus
corp[[1]][[1]]
# Print first line using 'content' element of corpus
corp[[1]]$content[[1]]
Both of these result in very long run times without the desired output.
The crude corpus in the tm package can be used for example purposes.
data("crude")
strwrap
does this job nicely since it prints your paragraphs formatted by breaking lines at word boundaries
. (See ?strwrap
.) Then you can use the head
function to see the first 6 lines.
head(strwrap(corp))
来源:https://stackoverflow.com/questions/49951708/print-first-line-of-one-element-of-corpus-in-r-using-tm-package