I have a number of PDF documents, which I have read into a corpus with library tm
. How can one break the corpus into sentences?
It can
I implemented the following code to solve the same problem using the tokenizers
package.
# Iterate a list or vector of strings and split into sentences where there are
# periods or question marks
sentences = purrr::map(.x = textList, function(x) {
return(tokenizers::tokenize_sentences(x))
})
# The code above will return a list of character vectors so unlist
# to give you a character vector of all the sentences
sentences = unlist(sentences)
# Create a corpus from the sentences
corpus = VCorpus(VectorSource(sentences))