We get a large amount of data from our clients in pdf files in varying formats [layout-wise], these files are typically report output, and are typically properly annotated [
pdftohtml -xml
although pdftoipe seems more detailed!!