Extracting the actual in-text title from a PDF

二次信任 提交于 2019-12-08 06:40:53

问题


There seems to be a lot of questions about extracting a title from a PDF (using its metadata). However, the large majority of the titles do not seem to exist in the metadata. I found this out when using http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html .

Is there anyway to actually retrieve the in text title from a pdf? I tried to export to a text file then search but there is no consistent formatting. Is there any way to export the pdf to a document with its formatting, then check for a font size >= 14 ?


回答1:


This is a very good question. Applications that create PDFs don't seem to do anything useful with the available metadata fields.

Take pdflatex as an example: even when one sets the \title{...} and \author{...} in the preamble, this information is not reflected in the metadata. After a quick search, the solution appears to be to introduce a block in the preamble which is read only by pdflatex [1]:

\pdfinfo
{
  /Title{...}
  /Author{...}
  ...
}

...which is then placed in the the relevant metadata fields of the PDF. It is strange that this is necessary, though.

I cannot speak for word processors like Word or Writer. One presumes such metadata fields have to be set manually by the user.

Perhaps a heuristic approach is the only way you can approach your problem if your PDFs are not generated by you. [2] seems like it does something similar to what you want, but I guess it depends how well published the PDFs are -- this tool seems to be scientific-paper oriented.

I hope that is at least some help.

[1] http://wlug.org.nz/PdfLatexNotes [2] http://www.molspaces.com/d_cb2bib-metadata.php



来源:https://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!