发表新帖

发表新帖

Recognize PDF table using R

后端未结

关注

 2  729

既然无缘 2020-11-28 11:27

I\'m trying to extract data from tables inside some pdf reports.

I\'ve seen some examples using either pdftools and similar packages I was successful in getting the

2条回答

甜味超标 (楼主)

2020-11-28 11:45
I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:
```
library(pdftools)
dat <- pdftools::pdf_text("https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf")
dat <- paste0(dat, collapse = " ")
pattern <- "Berufsfeuerwehr\\s+Straße(.)*02366.39258"
extract <- regmatches(dat, regexpr(pattern, dat))
extract <- gsub('\n', "  ", extract)
strsplit(extract, "\\s{2,}")
```
From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题