问题
This will be my first time reading a PDF.
I was searching around and found so options to do that with C# and choose to use iTextSharp.
So far I've done just the basic like reading the file and getting the content without issues.
PdfReader reader = new PdfReader(iPDF.Text);
for (int x = 2; x <= reader.NumberOfPages; x++)
{
iResult.Text = Encoding.UTF8.GetString(reader.GetPageContent(x));
break;
}
As you can see it is a very very basic code just to read the 2nd page of the PDF into a text file but, I've notice a lot of code into the text file and I am a bit lost on how to parse only the data I need.
What I was wondering is, if there is a pattern or something that will help me get only that part of the PDF. Looking at the plain text file it seems there are things that defines begin/end of lines, colors, etc.
Some of the extract data:
1 0 0 1 0 612 cm 0 0 0 rg
0 0 0 RG
28.35 -28.35 735.3 -526.95 re
W
n
0 0 0.502 sc
28.35 -65.5 735.3 -12.75 re
f
28.35 -543.9 735.3 -11.4 re
f
q
92.25 -28.35 560.9 -18 re
W
n
1 1 1 sc
92.25 -28.35 560.9 -18 re
f
BT
1 0 0 1 95.25 -39.1 Tm
0 0 0 sc
/i 10.75 Tf
(Name - Live) T
NOTE: the above is just partially the initial data from the page 2 to point out what I previously meant.
Is that data in a tabulation sort of thing and how could I extract only that ?
回答1:
Try using a PdfTextExtractor
as it will give you a little more human readable text out of the pdf:
for (int page = 2; page <= reader.NumberOfPages; page++)
{
var strategy = new SimpleTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
iResult.Text = text;
}
来源:https://stackoverflow.com/questions/12471395/reading-pdf-file