Read specific value based on label name from PDF in C#

前端未结

关注

 3  1407

梦谈多话 2021-01-19 11:07

I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name.

3条回答

甜味超标 (楼主)

2021-01-19 11:51
I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.

The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.

So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.

The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.

Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.

So you need to play with it a bit to fit it 100% to your pdf file.

So here is my test files Excel and Pdf zipped down. Download it for your test.

Here is the code:
```
public class InvoiceTextExtraction
{
    private List _contentList;

    public void GetValueFromPdf()
    {
        _contentList = new List();
        CreatePdfContent(@"C:\temp\Invoice1.pdf");

        var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
        int.TryParse(_contentList[index], out var value);
        Console.WriteLine(value);
    }


    public void CreatePdfContent(string filePath)
    {
        using (var file = new File(filePath))
        {
            var document = file.Document;

            foreach (var page in document.Pages)
            {
                Extract(new ContentScanner(page));
            }
        }
    }

    private void Extract(ContentScanner level)
    {
        if (level == null)
            return;

        while (level.MoveNext())
        {
            var content = level.Current;
            switch (content)
            {
                case ShowText text:
                {
                    var font = level.State.Font;
                    _contentList.Add(font.Decode(text.Text));
                    break;
                }
                case Text _:
                case ContainerObject _:
                    Extract(level.ChildLevel);
                    break;
            }
        }
    }
}
```
Input extracted from pdf file. The code scan return following elements:
```
INVOICE
0005

PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019
```
and here is the result
```
5
```
The code is inspired from this link.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...