Reading PDF file? | 易学教程

问题

This will be my first time reading a PDF.

I was searching around and found so options to do that with C# and choose to use iTextSharp.

So far I've done just the basic like reading the file and getting the content without issues.

PdfReader reader = new PdfReader(iPDF.Text);
for (int x = 2; x <= reader.NumberOfPages; x++)
{
    iResult.Text = Encoding.UTF8.GetString(reader.GetPageContent(x));
    break;
}

As you can see it is a very very basic code just to read the 2nd page of the PDF into a text file but, I've notice a lot of code into the text file and I am a bit lost on how to parse only the data I need.

What I was wondering is, if there is a pattern or something that will help me get only that part of the PDF. Looking at the plain text file it seems there are things that defines begin/end of lines, colors, etc.

Some of the extract data:

1 0 0 1 0 612 cm 0 0 0 rg
0 0 0 RG
28.35 -28.35 735.3 -526.95 re
W
n
0 0 0.502 sc
28.35 -65.5 735.3 -12.75 re
f
28.35 -543.9 735.3 -11.4 re
f
q
92.25 -28.35 560.9 -18 re
W
n
1 1 1 sc
92.25 -28.35 560.9 -18 re
f
BT
1 0 0 1 95.25 -39.1 Tm
0 0 0 sc
/i 10.75 Tf
(Name - Live) T

NOTE: the above is just partially the initial data from the page 2 to point out what I previously meant.

Is that data in a tabulation sort of thing and how could I extract only that ?

回答1:

Try using a PdfTextExtractor as it will give you a little more human readable text out of the pdf:

for (int page = 2; page <= reader.NumberOfPages; page++)
{
    var strategy = new SimpleTextExtractionStrategy();
    string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
    iResult.Text = text;
}

来源：https://stackoverflow.com/questions/12471395/reading-pdf-file

标签

pdf

.net-4.0

itextsharp

extract