Split TextChunk into words

送分小仙女□ 提交于 2019-11-28 12:41:27

问题


I've found this example which splits a pdf document into TextChunks

Is there either

a) a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?

or

b) a method to get parse a PDF into words/characters instead of chunks and find the location?


回答1:


Is there a method to split each TextChunk further into words/characters from each TextChunk and still be able to find it's location?

You cannot split these TextChunk objects further because this TextChunk class is merely a helper class transporting a very small amount of information, cf. its constructor arguments String str, Vector startLocation, Vector endLocation, float charSpaceWidth, especially there is no information on the individual character widths or the associated text size and font to derive the individual character widths from.

But you can of course change the method RenderText (in which the incoming more complete TextRenderInfo instances are reduced to TextChunk instances):

public virtual void RenderText(TextRenderInfo renderInfo) {
  LineSegment segment = renderInfo.GetBaseline();
  TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
  locationalResult.Add(location);        
}

In particular you can first split the TextRenderInfo instance using its GetCharacterRenderInfos() method into single character TextRenderInfo instances, loop through these and create individual TextChunk instances for each of them.

You probably don't see that method in the repository where you are looking as iTextSharp has already switched to the new SourceForge versioning infrastructure. Thus, you should switch to the current iTextSharp repository.

Is there a method to get parse a PDF into words/characters instead of chunks and find the location?

Of course you can implement IRenderListener to create an extraction strategy which does exactly what you need. You can find some discussions of that topic on stackoverflow for iText and iTextSharp, e.g. ITextSharp Find coordinates of specific text in PDF, Get the exact Stringposition in PDF, Retrieve the respective coordinates of all words on the page with itextsharp and others.



来源:https://stackoverflow.com/questions/15076076/split-textchunk-into-words

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!