R break corpus into sentences

后端未结

关注

 7  745

面向向阳花

I have a number of PDF documents, which I have read into a corpus with library tm. How can one break the corpus into sentences?
It can

相关标签:

7条回答

予麋鹿

2020-12-14 04:24

I implemented the following code to solve the same problem using the tokenizers package.

# Iterate a list or vector of strings and split into sentences where there are
# periods or question marks
sentences = purrr::map(.x = textList, function(x) {
  return(tokenizers::tokenize_sentences(x))
})

# The code above will return a list of character vectors so unlist
# to give you a character vector of all the sentences
sentences = unlist(sentences)

# Create a corpus from the sentences
corpus = VCorpus(VectorSource(sentences))

0 讨论(0)

上一页 1 2