pdf-parsing

How to Detect table start in itextSharp?

六月ゝ 毕业季﹏ 提交于 2019-12-04 18:35:24
I am trying to convert pdf to csv file. pdf file has data in tabular format with first row as header. I have reached to the level where I can extract text from a cell, compare the baseline of text in table and detect newline but I need to compare table borders to detect start of table. I do not know how to detect and compare lines in PDF. Can anyone help me? Thanks!!! Chris Haas As you've seen (hopefully), PDFs have no concept of tables, just text placed at specific locations and lines drawn around them. There is no internal relationship between the text and the lines. This is very important

Parsing PDF files in Hadoop Map Reduce

Deadly 提交于 2019-12-04 07:27:54
I have to parse PDF files , that are in HDFS in a Map Reduce Program in Hadoop. So i get the PDF file from HDFS as Input splits and it has to be parsed and sent to the Mapper Class. For implementing this InputFormat I had gone through this link . How can the these input splits be parsed and converted into text format ? Processing PDF files in Hadoop can be done by extending FileInputFormat Class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class you override the getRecordReader() method. Now each pdf will be received as an Individual Input Split . Then these

PDFTextStripper parsing with wrong encoding

混江龙づ霸主 提交于 2019-12-04 06:18:20
问题 PDFTextStripper stripper = new PDFText2HTML(encoding); String result = stripper.getText(document).trim(); result contains something like <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat SeLe EE rev</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <div style="page-break-before:always; page-break-after:always"><div><p>&#0;&#1;&#2;&#3;&#4;&#5;&#6;&#7;&#... instead of <

Looking for recommendation on how to convert PDF into structured format

你离开我真会死。 提交于 2019-12-03 06:55:07
I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured format but instead provides a 700+ page PDF of the properties going up for auction. I'm wondering if the community has any thoughts as to how I can approach parsing said PDF into a structured format for insertion into a db or to create a spreadsheet of the properties. Here's an image of what each page represents: And here's a page that lists some properties: I'm comfortable with python and ruby so I don't have any issues

What is this (cid:51) in the output of pdf2txt?

≯℡__Kan透↙ 提交于 2019-12-01 16:09:45
So i'm trying to extract the text from a pdf file, I need its position, width, height, font. I have tried many, but the most useful and complete solution looks to be PDFMiner , and in this case, more exactly pdf2txt.py . I have followed the doc and the examples and tried to extract the text Learn More from my pdf using this command: pdf2txt.py -Y normal -t xml -o buttons.xml buttons.pdf And the output buttons.xml looks like that: <?xml version="1.0" encoding="utf-8" ?> <pages> <page id="1" bbox="0.000,0.000,799.900,449.944" rotate="0"> <textbox id="0" bbox="164.979,213.240,247.680,235.944">

What is this (cid:51) in the output of pdf2txt?

给你一囗甜甜゛ 提交于 2019-12-01 14:17:01
问题 So i'm trying to extract the text from a pdf file, I need its position, width, height, font. I have tried many, but the most useful and complete solution looks to be PDFMiner, and in this case, more exactly pdf2txt.py. I have followed the doc and the examples and tried to extract the text Learn More from my pdf using this command: pdf2txt.py -Y normal -t xml -o buttons.xml buttons.pdf And the output buttons.xml looks like that: <?xml version="1.0" encoding="utf-8" ?> <pages> <page id="1" bbox

Difference between iTextSharp 4.1.6 and 5.x versions

巧了我就是萌 提交于 2019-11-28 11:28:16
We are developing a Pdf parser to be used along with our system. The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document). We did some googling and found iTextSharp be the best mate for our purpose. We are developing our project using .net. You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL. We

Parsing a PDF with no /Root object using PDFMiner

﹥>﹥吖頭↗ 提交于 2019-11-28 10:01:04
I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs: ipython stack trace: /usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser) 331 break 332 else: --> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?') 334 if self.catalog.get('Type') is not LITERAL_CATALOG: 335 if STRICT: PDFSyntaxError: No /Root object! - Is this really a PDF? Of course, I immediately checked to see whether or not these PDFs were corrupted, but

Extracting table contents from a collection of PDF files [closed]

江枫思渺然 提交于 2019-11-28 03:12:46
I have a stack of PDFs - potentially hundreds or thousands. They are not all formatted the same, but any of them MAY have one or more tables with interesting information that I would like to collect into a separate database. Of course, I know I have to write something to do this. Perl is an option for me - or perhaps Java. I don't really care what language so long as it's free (or cheap with a free trial period to ensure it suits my purposes). I'm looking at CAM::Parse (using strawberry Perl), but I'm not sure how to use it to locate and extract tables from the files. I guess I do have a

Parsing a PDF with no /Root object using PDFMiner

北城余情 提交于 2019-11-27 18:10:44
问题 I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs: ipython stack trace: /usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser) 331 break 332 else: --> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?') 334 if self.catalog.get('Type') is not LITERAL_CATALOG: 335 if STRICT: PDFSyntaxError: No /Root object! - Is