Read specific value based on label name from PDF in C#

前端 未结 3 1398
梦谈多话
梦谈多话 2021-01-19 11:07

I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name.

3条回答
  •  甜味超标
    2021-01-19 11:51

    I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.

    The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.

    So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.

    The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.

    Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.

    So you need to play with it a bit to fit it 100% to your pdf file.

    So here is my test files Excel and Pdf zipped down. Download it for your test.

    Here is the code:

    public class InvoiceTextExtraction
    {
        private List _contentList;
    
        public void GetValueFromPdf()
        {
            _contentList = new List();
            CreatePdfContent(@"C:\temp\Invoice1.pdf");
    
            var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
            int.TryParse(_contentList[index], out var value);
            Console.WriteLine(value);
        }
    
    
        public void CreatePdfContent(string filePath)
        {
            using (var file = new File(filePath))
            {
                var document = file.Document;
    
                foreach (var page in document.Pages)
                {
                    Extract(new ContentScanner(page));
                }
            }
        }
    
        private void Extract(ContentScanner level)
        {
            if (level == null)
                return;
    
            while (level.MoveNext())
            {
                var content = level.Current;
                switch (content)
                {
                    case ShowText text:
                    {
                        var font = level.State.Font;
                        _contentList.Add(font.Decode(text.Text));
                        break;
                    }
                    case Text _:
                    case ContainerObject _:
                        Extract(level.ChildLevel);
                        break;
                }
            }
        }
    }
    

    Input extracted from pdf file. The code scan return following elements:

    INVOICE
    0005
    
    PAYMENT DUE BY:
    4/19/2019
    .etc
    .
    .
    .
    Tax
    USD TOTAL
    171857
    18 september 2019
    

    and here is the result

    5
    

    The code is inspired from this link.

提交回复
热议问题