How to find text from pdf image?

后端 未结 2 794
误落风尘
误落风尘 2021-02-11 01:27

I am developing a C# application in which I am converting a PDF document to an image and then rendering that image in a custom viewer.

I\'ve come across a bit of a bric

2条回答
  •  不知归路
    2021-02-11 01:48

    You can use tessract OCR image for text recognition in console mode.

    I don't know about such SDK for pdf.

    BUT, if you want to get all word coordinates and values, you can use next my not complex code, thank nguyenq for hocr hint:

    public void Recognize(Bitmap bitmap)
    {
        bitmap.Save("temp.png", ImageFormat.Png);
        var startInfo = new ProcessStartInfo("tesseract.exe", "temp.png temp hocr");
        startInfo.WindowStyle = ProcessWindowStyle.Hidden;
        var process = Process.Start(startInfo);
        process.WaitForExit();
    
        GetWords(File.ReadAllText("temp.html"));
    
        // Futher actions with words
    }
    
    public Dictionary GetWords(string tesseractHtml)
    {
        var xml = XDocument.Parse(tesseractHtml);
    
        var rectsWords = new Dictionary();
    
        var ocr_words = xml.Descendants("span").Where(element => element.Attribute("class").Value == "ocr_word").ToList();
        foreach (var ocr_word in ocr_words)
        {
            var strs = ocr_word.Attribute("title").Value.Split(' ');
            int left = int.Parse(strs[1]);
            int top = int.Parse(strs[2]);
            int width = int.Parse(strs[3]) - left + 1;
            int height = int.Parse(strs[4]) - top + 1;
            rectsWords.Add(new Rectangle(left, top, width, height), ocr_word.Value);
        }
    
        return rectsWords;
    }
    

提交回复
热议问题