see https://github.com/WolfgangFahl/pdfindexer
for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text,
index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.