I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI t
To properly extract formatted text a library/utility should:
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.