R break corpus into sentences

后端 未结 7 714
面向向阳花
面向向阳花 2020-12-14 03:59
  1. I have a number of PDF documents, which I have read into a corpus with library tm. How can one break the corpus into sentences?

  2. It can

相关标签:
7条回答
  • 2020-12-14 04:24

    I implemented the following code to solve the same problem using the tokenizers package.

    # Iterate a list or vector of strings and split into sentences where there are
    # periods or question marks
    sentences = purrr::map(.x = textList, function(x) {
      return(tokenizers::tokenize_sentences(x))
    })
    
    # The code above will return a list of character vectors so unlist
    # to give you a character vector of all the sentences
    sentences = unlist(sentences)
    
    # Create a corpus from the sentences
    corpus = VCorpus(VectorSource(sentences))
    
    0 讨论(0)
提交回复
热议问题