Extract text from PDF in respect to formatting (font size, type etc)
问题 Is possible to extract text from PDF file in respect to specific font/font size/font color etc.? I prefer perl, python or *nix command line utilities. My goal is to extract all headlines from PDF file so I will have nice index of articles contained in single PDF. 回答1: Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g.