pdf-scraping | 易学教程

search for specific text in large pdf by format, after decrypting it

阅读更多关于 search for specific text in large pdf by format, after decrypting it

问题 I've gotten to the point where I can locate, decrypt in, open, and count number of pages in my large pdf file.... Now; I am simply wanting to grab the below (it is on each page, at line 6). I'm wondering if I should continue to attempt gather via line number (which I tried, but error-ed out saying not indexed at 0). Or, try to regex from the zipcode format backwards? Text format needed from each page in large PDF. Want to scrape and put into two variables. ( i.e. MemberName1 = ;

Programmatically replace text in PDF

阅读更多关于 Programmatically replace text in PDF

问题 I have PDF files with text that should be replaced. More specificly, the text should be translated and replaced with the translated version. It's important that the rest of the PDF structure stays intact. Note that the text is available in the PDFs and techniques like OCr are not needed. Also, it would be nice if font and other text attributes are kept. Which libraries would you recommend for extracting the text to an easy to edit format (such as CSV) and put the new text back in again? 回答1:

Scraping large pdf tables which span across multiple pages

阅读更多关于 Scraping large pdf tables which span across multiple pages

问题 I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is that the resultant text file is not easy to work with, as the table layout differs across pages, so the columns are not aligned. Also note missing values in lines beginning with "Solsonès": TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT N Alt Camp VY Nulles 7,5 5,5 10,9 12,3

tm readPDF: Error in file(con, “r”) : cannot open the connection

阅读更多关于 tm readPDF: Error in file(con, “r”) : cannot open the connection

问题 I have tried the example code recommended in the tm::readPDF documentation: library(tm) if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) { uri <- system.file(file.path("doc", "tm.pdf"), package = "tm") pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = uri), language = "en", id = "id1") pdf[1:13] } But I get the following error (which occurs after calling the function returned by readPDF ): Error in file(con, "r") : cannot open the connection In addition: Warning message

Python module for converting PDF to text [closed]

阅读更多关于 Python module for converting PDF to text [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Which are the best Python modules to convert PDF files into text? 回答1: Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format. The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text. A Python 3 version is available under: https:/

Working on tables in pdf using python

阅读更多关于 Working on tables in pdf using python

问题 I am working on a pdf file. There is number of tables in that pdf. According to the table names given in the pdf, I wanted to fetch the data from that table using python. I have worked on html, xlm parsing but never with pdf. Can anyone tell me how to fetch tables from pdf using python? 回答1: I think that you need a python parser library. The most famous is PDFMiner. According to the documentation : PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related

How to unlock a “secured” (read-protected) PDF in Python?

阅读更多关于 How to unlock a “secured” (read-protected) PDF in Python?

问题 In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0> When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this

Reading data from PDF files into R

阅读更多关于 Reading data from PDF files into R

问题 Is that even possible!?! I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool? The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells". 回答1: Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to

Scraping large pdf tables which span across multiple pages

阅读更多关于 Scraping large pdf tables which span across multiple pages

I am trying to scrape PDF tables which span across multiple pages . I tried many things but the best seems to be pdftotext -layout as advised here . The problem is that the resultant text file is not easy to work with, as the table layout differs across pages, so the columns are not aligned. Also note missing values in lines beginning with "Solsonès": TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT N Alt Camp VY Nulles 7,5 5,5 10,9 12,3 16,7 21,6 22,3 24,4 20,1 15,9 Alt Camp DQ Vila-rodona 7,9 5,6 11,0 12,0 16,6 21,6 22,0 24,3 19,9 15,8

How to unlock a “secured” (read-protected) PDF in Python?

阅读更多关于 How to unlock a “secured” (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0> When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection