问题
There seems to be a lot of questions about extracting a title from a PDF (using its metadata). However, the large majority of the titles do not seem to exist in the metadata. I found this out when using http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html .
Is there anyway to actually retrieve the in text title from a pdf? I tried to export to a text file then search but there is no consistent formatting. Is there any way to export the pdf to a document with its formatting, then check for a font size >= 14 ?
回答1:
This is a very good question. Applications that create PDFs don't seem to do anything useful with the available metadata fields.
Take pdflatex as an example: even when one sets the \title{...} and \author{...} in the preamble, this information is not reflected in the metadata. After a quick search, the solution appears to be to introduce a block in the preamble which is read only by pdflatex [1]:
\pdfinfo
{
/Title{...}
/Author{...}
...
}
...which is then placed in the the relevant metadata fields of the PDF. It is strange that this is necessary, though.
I cannot speak for word processors like Word or Writer. One presumes such metadata fields have to be set manually by the user.
Perhaps a heuristic approach is the only way you can approach your problem if your PDFs are not generated by you. [2] seems like it does something similar to what you want, but I guess it depends how well published the PDFs are -- this tool seems to be scientific-paper oriented.
I hope that is at least some help.
[1] http://wlug.org.nz/PdfLatexNotes [2] http://www.molspaces.com/d_cb2bib-metadata.php
来源:https://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf