Text extraction from a PDF using iText7. How to improve its performance?

后端 未结 1 1688
没有蜡笔的小新
没有蜡笔的小新 2021-01-03 16:53

Currently, I use this code to extract text from a Rectangle (area).

public static class ReaderExtensions
{
    public static string ExtractText(this PdfPage          


        
相关标签:
1条回答
  • 2021-01-03 17:04

    As already mentioned in a comment, I was surprised to see that the iText 7 LocationTextExtractionStrategy does not anymore contain something akin to the iText 5 LocationTextExtractionStrategy method GetResultantText(TextChunkFilter). This would have allowed you to parse the page once and extract text from text pieces in arbitrary page areas out of the box.

    But it is possible to bring back that feature. One option for this would be to add it to a copy of the LocationTextExtractionStrategy. This would be kind of a long answer here, though. So I used another option: I use the existing LocationTextExtractionStrategy, and merely for the GetResultantText call I manipulate the underlying list of text chunks of the strategy. Instead of a generic TextChunkFilter interface I restricted filtering to the criteria at hand, the filtering by rectangular area.

    public static class ReaderExtensions
    {
        public static string[] ExtractText(this PdfPage page, params Rectangle[] rects)
        {
            var textEventListener = new LocationTextExtractionStrategy();
            PdfTextExtractor.GetTextFromPage(page, textEventListener);
            string[] result = new string[rects.Length];
            for (int i = 0; i < result.Length; i++)
            {
                result[i] = textEventListener.GetResultantText(rects[i]);
            }
            return result;
        }
    
        public static String GetResultantText(this LocationTextExtractionStrategy strategy, Rectangle rect)
        {
            IList<TextChunk> locationalResult = (IList<TextChunk>)locationalResultField.GetValue(strategy);
            List<TextChunk> nonMatching = new List<TextChunk>();
            foreach (TextChunk chunk in locationalResult)
            {
                ITextChunkLocation location = chunk.GetLocation();
                Vector start = location.GetStartLocation();
                Vector end = location.GetEndLocation();
                if (!rect.IntersectsLine(start.Get(Vector.I1), start.Get(Vector.I2), end.Get(Vector.I1), end.Get(Vector.I2)))
                {
                    nonMatching.Add(chunk);
                }
            }
            nonMatching.ForEach(c => locationalResult.Remove(c));
            try
            {
                return strategy.GetResultantText();
            }
            finally
            {
                nonMatching.ForEach(c => locationalResult.Add(c));
            }
        }
    
        static FieldInfo locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);
    }
    

    The central extension is the LocationTextExtractionStrategy extension which takes a LocationTextExtractionStrategy which already contains the information from a page, restricts these information to those in a given rectangle, extracts the text, and returns the information to the previous state. This requires some reflection; I hope that is ok for you.

    0 讨论(0)
提交回复
热议问题