pdf-scraping

search for specific text in large pdf by format, after decrypting it

梦想的初衷 提交于 2020-01-06 05:50:07
问题 I've gotten to the point where I can locate, decrypt in, open, and count number of pages in my large pdf file.... Now; I am simply wanting to grab the below (it is on each page, at line 6). I'm wondering if I should continue to attempt gather via line number (which I tried, but error-ed out saying not indexed at 0). Or, try to regex from the zipcode format backwards? Text format needed from each page in large PDF. Want to scrape and put into two variables. ( i.e. MemberName1 = ;

Programmatically replace text in PDF

拈花ヽ惹草 提交于 2019-12-21 05:14:24
问题 I have PDF files with text that should be replaced. More specificly, the text should be translated and replaced with the translated version. It's important that the rest of the PDF structure stays intact. Note that the text is available in the PDFs and techniques like OCr are not needed. Also, it would be nice if font and other text attributes are kept. Which libraries would you recommend for extracting the text to an easy to edit format (such as CSV) and put the new text back in again? 回答1:

Scraping large pdf tables which span across multiple pages

我与影子孤独终老i 提交于 2019-12-20 10:46:36
问题 I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is that the resultant text file is not easy to work with, as the table layout differs across pages, so the columns are not aligned. Also note missing values in lines beginning with "Solsonès": TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT N Alt Camp VY Nulles 7,5 5,5 10,9 12,3

tm readPDF: Error in file(con, “r”) : cannot open the connection

给你一囗甜甜゛ 提交于 2019-12-18 09:35:02
问题 I have tried the example code recommended in the tm::readPDF documentation: library(tm) if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) { uri <- system.file(file.path("doc", "tm.pdf"), package = "tm") pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = uri), language = "en", id = "id1") pdf[1:13] } But I get the following error (which occurs after calling the function returned by readPDF ): Error in file(con, "r") : cannot open the connection In addition: Warning message

Python module for converting PDF to text [closed]

雨燕双飞 提交于 2019-12-16 19:56:28
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Which are the best Python modules to convert PDF files into text? 回答1: Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format. The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text. A Python 3 version is available under: https:/

Working on tables in pdf using python

♀尐吖头ヾ 提交于 2019-12-08 22:19:33
问题 I am working on a pdf file. There is number of tables in that pdf. According to the table names given in the pdf, I wanted to fetch the data from that table using python. I have worked on html, xlm parsing but never with pdf. Can anyone tell me how to fetch tables from pdf using python? 回答1: I think that you need a python parser library. The most famous is PDFMiner. According to the documentation : PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related

How to unlock a “secured” (read-protected) PDF in Python?

依然范特西╮ 提交于 2019-12-03 05:09:16
问题 In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0> When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this

Reading data from PDF files into R

你离开我真会死。 提交于 2019-12-03 00:46:00
问题 Is that even possible!?! I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool? The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells". 回答1: Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to

Scraping large pdf tables which span across multiple pages

∥☆過路亽.° 提交于 2019-12-02 23:06:31
I am trying to scrape PDF tables which span across multiple pages . I tried many things but the best seems to be pdftotext -layout as advised here . The problem is that the resultant text file is not easy to work with, as the table layout differs across pages, so the columns are not aligned. Also note missing values in lines beginning with "Solsonès": TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT N Alt Camp VY Nulles 7,5 5,5 10,9 12,3 16,7 21,6 22,3 24,4 20,1 15,9 Alt Camp DQ Vila-rodona 7,9 5,6 11,0 12,0 16,6 21,6 22,0 24,3 19,9 15,8

How to unlock a “secured” (read-protected) PDF in Python?

倾然丶 夕夏残阳落幕 提交于 2019-12-02 20:52:23
In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0> When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection