问题
Is possible to extract text from PDF file in respect to specific font/font size/font color etc.? I prefer perl, python or *nix command line utilities. My goal is to extract all headlines from PDF file so I will have nice index of articles contained in single PDF.
回答1:
Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.
来源:https://stackoverflow.com/questions/19386711/extract-text-from-pdf-in-respect-to-formatting-font-size-type-etc