Extracting text data from PDF files

后端 未结 7 1912
[愿得一人]
[愿得一人] 2020-12-02 11:24

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?

7条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-02 11:47

    A purely R solution could be:

    library('tm')
    file <- 'namefile.pdf'
    Rpdf <- readPDF(control = list(text = "-layout"))
    corpus <- VCorpus(URISource(file), 
          readerControl = list(reader = Rpdf))
    corpus.array <- content(content(corpus)[[1]])
    

    then you'll have pdf lines in an array.

提交回复
热议问题