I\'m trying to get all words and their location coordinates from a PDF file. I\'ve succeeded using the Acrobat API on .NET. Now, I\'m trying to get the same res
You'll want to use the com.itextpdf.text.pdf.parser package classes. They track the current transformation, color, font, and so forth.
Sadly, these classes weren't covered in the new book, so you're left with the JavaDoc, and mentally converting it all from Java to C#, which isn't much of a stretch.
So you'll want to plug a LocationTextExtractionStrategy into a PdfTextExtractor.
This will give you the strings and locations as they are presented in the pdf. It will be up to you to interpret that as words (and paragraphs if need be, ouch).
Keep in mind that PDF doesn't know anything about text layout. Every character can be placed individually. If someone were so inclined (and they'd have to be a few tacos short of a combo platter to do so) they could draw all the 'a's on a given page, then all the 'b's, and so forth.
More realistically, someone might draw all the text on the page that uses FontA, then everything for FontB, and so on. This can produce more efficient content streams. Keep in mind that italic and bold (and bold italic) are all separate fonts. If someone marks only part of a word as bold (or whatever), then that Logical Word is required to be broken up into at least two drawing commands.
But lots of folks just write out their text into PDF in logical order... which is Very Handy for folks who are stuck trying to parse it, but you Must Not Expect It. Because you will invariably run into some oddball that doesn't.