Split TextChunk into words

问题

I've found this example which splits a pdf document into TextChunks

Is there either

a) a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?

b) a method to get parse a PDF into words/characters instead of chunks and find the location?

回答1:

Is there a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?

You cannot split these TextChunk objects further because this TextChunk class is merely a helper class transporting a very small amount of information, cf. its constructor arguments String str, Vector startLocation, Vector endLocation, float charSpaceWidth, especially there is no information on the individual character widths or the associated text size and font to derive the individual character widths from.

But you can of course change the method RenderText (in which the incoming more complete TextRenderInfo instances are reduced to TextChunk instances):

public virtual void RenderText(TextRenderInfo renderInfo) {
  LineSegment segment = renderInfo.GetBaseline();
  TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
  locationalResult.Add(location);        
}

In particular you can first split the TextRenderInfo instance using its GetCharacterRenderInfos() method into single character TextRenderInfo instances, loop through these and create individual TextChunk instances for each of them.

You probably don't see that method in the repository where you are looking as iTextSharp has already switched to the new SourceForge versioning infrastructure. Thus, you should switch to the current iTextSharp repository.

Is there a method to get parse a PDF into words/characters instead of chunks and find the location?

Of course you can implement IRenderListener to create an extraction strategy which does exactly what you need. You can find some discussions of that topic on stackoverflow for iText and iTextSharp, e.g. ITextSharp Find coordinates of specific text in PDF, Get the exact Stringposition in PDF, Retrieve the respective coordinates of all words on the page with itextsharp and others.

来源：https://stackoverflow.com/questions/15076076/split-textchunk-into-words

标签

itextsharp