Parsing a PDF with no /Root object using PDFMiner

前端 未结 5 1408
日久生厌
日久生厌 2020-12-16 17:19

I\'m trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a

相关标签:
5条回答
  • 2020-12-16 17:34

    interesting problem. i had performed some kind of research:

    function which parsed pdf (from miners source code):

    def set_parser(self, parser):
            "Set the document to use a given PDFParser object."
            if self._parser: return
            self._parser = parser
            # Retrieve the information of each header that was appended
            # (maybe multiple times) at the end of the document.
            self.xrefs = parser.read_xref()
            for xref in self.xrefs:
                trailer = xref.get_trailer()
                if not trailer: continue
                # If there's an encryption info, remember it.
                if 'Encrypt' in trailer:
                    #assert not self.encryption
                    self.encryption = (list_value(trailer['ID']),
                                       dict_value(trailer['Encrypt']))
                if 'Info' in trailer:
                    self.info.append(dict_value(trailer['Info']))
                if 'Root' in trailer:
                    #  Every PDF file must have exactly one /Root dictionary.
                    self.catalog = dict_value(trailer['Root'])
                    break
            else:
                raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
            if self.catalog.get('Type') is not LITERAL_CATALOG:
                if STRICT:
                    raise PDFSyntaxError('Catalog not found!')
            return
    

    if you will be have problem with EOF another exception will be raised: '''another function from source'''

    def load(self, parser, debug=0):
            while 1:
                try:
                    (pos, line) = parser.nextline()
                    if not line.strip(): continue
                except PSEOF:
                    raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
                if not line:
                    raise PDFNoValidXRef('Premature eof: %r' % parser)
                if line.startswith('trailer'):
                    parser.seek(pos)
                    break
                f = line.strip().split(' ')
                if len(f) != 2:
                    raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
                try:
                    (start, nobjs) = map(long, f)
                except ValueError:
                    raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
                for objid in xrange(start, start+nobjs):
                    try:
                        (_, line) = parser.nextline()
                    except PSEOF:
                        raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
                    f = line.strip().split(' ')
                    if len(f) != 3:
                        raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
                    (pos, genno, use) = f
                    if use != 'n': continue
                    self.offsets[objid] = (int(genno), long(pos))
            if 1 <= debug:
                print >>sys.stderr, 'xref objects:', self.offsets
            self.load_trailer(parser)
            return
    

    from wiki(pdf specs): A PDF file consists primarily of objects, of which there are eight types:

    Boolean values, representing true or false
    Numbers
    Strings
    Names
    Arrays, ordered collections of objects
    Dictionaries, collections of objects indexed by Names
    Streams, usually containing large amounts of data
    The null object
    

    Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.

    i thk the problem is your "damaged pdf" have a few 'root elements' on the page.

    Possible solution:

    you can download sources and write `print function' in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).

    ps: i think it some kind of bug in product.

    0 讨论(0)
  • 2020-12-16 17:39

    I have had this same problem in Ubuntu. I have a very simple solution. Just print the pdf-file as a pdf. If you are in Ubuntu:

    1. Open a pdf file using the (ubuntu) document viewer.

    2. Goto File

    3. Goto print

    4. Choose print as file and check the mark "pdf"

    If you want to make the process automatic, follow for instance this, i.e., use this script to print automatically all your pdf files. A linux script like this also works:

    for f in *.pdfx
    do
    lowriter --headless --convert-to pdf "$f"
    done
    

    Note I called the original (problematic) pdf files as pdfx.

    0 讨论(0)
  • 2020-12-16 17:39

    An answer above is right. This error appears only in windows, and workaround is to replace with open(path, 'rb') to fp = open(path,'rb')

    0 讨论(0)
  • 2020-12-16 17:45

    The solution in slate pdf is use 'rb' --> read binary mode.

    Because slate pdf is depends on the PDFMiner and I have the same problem, this should solve your problem.

    fp = open('C:\Users\USER\workspace\slate_minner\document1.pdf','rb')
    doc = slate.PDF(fp)
    print doc
    
    0 讨论(0)
  • 2020-12-16 18:00

    I got this error as well and kept trying fp = open('example','rb')

    However, I still got the error OP indicated. What I found is that I had bug in my code where the PDF was still open by another function.
    So make sure you don't have the PDF open in memory elsewhere as well.

    0 讨论(0)
提交回复
热议问题