how can i get text formatting with iTextSharp

可紊 提交于 2019-11-25 19:34:50
Chris Haas

Let me try pointing you in a different direction. iTextSharp has a really beautiful and simple text extraction system that handle some of the basic tokens. Unfortunately it doesn't handle color information but according to @Mark Storer it might not be too hard to implement yourself.

BEGIN EDIT

I started work on implementing color information. See my blog post here for more details. (Sorry for the bad formatting, heading off to dinner now.)

END EDIT

The code below combines several questions and answers here including this one to get the font height (although its not exact) as well as another one (that for the life of me I can't seem to find anymore) that shows how to detect for faux bold.

The PostscriptFontName returns some additional characters in front of the font name, I think it has to do with when you embed font subsets.

Below is a complete WinForms application that targets iTextSharp 5.1.1.0 and extracts text as HTML.

Screenshot of sample PDF

Sample text extracted as HTML

<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Hello </span> <span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">w</span> <span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:37.87201">o</span> <span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">rl</span> <span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">d </span> <br /> <span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Test </span> 

Code

using System; using System.Collections.Generic; using System.Text; using System.Windows.Forms; using iTextSharp.text.pdf.parser; using iTextSharp.text.pdf;  namespace WindowsFormsApplication2 {     public partial class Form1 : Form     {         public Form1()         {             InitializeComponent();         }          private void Form1_Load(object sender, EventArgs e)         {             PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf"));             TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();             string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);             Console.WriteLine(F);              this.Close();         }          public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy         {             //HTML buffer             private StringBuilder result = new StringBuilder();              //Store last used properties             private Vector lastBaseLine;             private string lastFont;             private float lastFontSize;              //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html             private enum TextRenderMode             {                 FillText = 0,                 StrokeText = 1,                 FillThenStrokeText = 2,                 Invisible = 3,                 FillTextAndAddToPathForClipping = 4,                 StrokeTextAndAddToPathForClipping = 5,                 FillThenStrokeTextAndAddToPathForClipping = 6,                 AddTextToPaddForClipping = 7             }                public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)             {                 string curFont = renderInfo.GetFont().PostscriptFontName;                 //Check if faux bold is used                 if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))                 {                     curFont += "-Bold";                 }                  //This code assumes that if the baseline changes then we're on a newline                 Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();                 Vector topRight = renderInfo.GetAscentLine().GetEndPoint();                 iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);                 Single curFontSize = rect.Height;                  //See if something has changed, either the baseline, the font or the font size                 if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))                 {                     //if we've put down at least one span tag close it                     if ((this.lastBaseLine != null))                     {                         this.result.AppendLine("</span>");                     }                     //If the baseline has changed then insert a line break                     if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])                     {                         this.result.AppendLine("<br />");                     }                     //Create an HTML tag with appropriate styles                     this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);                 }                  //Append the current text                 this.result.Append(renderInfo.GetText());                  //Set currently used properties                 this.lastBaseLine = curBaseline;                 this.lastFontSize = curFontSize;                 this.lastFont = curFont;             }              public string GetResultantText()             {                 //If we wrote anything then we'll always have a missing closing tag so close it here                 if (result.Length > 0)                 {                     result.Append("</span>");                 }                 return result.ToString();             }              //Not needed             public void BeginTextBlock() { }             public void EndTextBlock() { }             public void RenderImage(ImageRenderInfo renderInfo) { }         }     } } 

I converted @Chris code to Java if anyone is looking for it

import com.itextpdf.text.Rectangle; import com.itextpdf.text.pdf.parser.ImageRenderInfo; import com.itextpdf.text.pdf.parser.TextExtractionStrategy; import com.itextpdf.text.pdf.parser.TextRenderInfo; import com.itextpdf.text.pdf.parser.Vector;  public class TextWithFontExtractionStategy implements TextExtractionStrategy { //HTML buffer private StringBuilder result = new StringBuilder();  //Store last used properties private Vector lastBaseLine; private String lastFont; private float lastFontSize;  //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html private enum TextRenderMode {     FillText(0),     StrokeText(1),     FillThenStrokeText(2),     Invisible(3),     FillTextAndAddToPathForClipping(4),     StrokeTextAndAddToPathForClipping(5),     FillThenStrokeTextAndAddToPathForClipping(6),     AddTextToPaddForClipping(7);      private int value;      TextRenderMode(int value) {         this.value = value;     }      public int getValue() {         return value;     } }      public void renderText(TextRenderInfo renderInfo)     {         String curFont = renderInfo.getFont().getPostscriptFontName();         //Check if faux bold is used         if ((renderInfo.getTextRenderMode() == TextRenderMode.FillThenStrokeText.getValue()))         {             curFont += "-Bold";         }          //This code assumes that if the baseline changes then we're on a newline         Vector curBaseline = renderInfo.getBaseline().getStartPoint();         Vector topRight = renderInfo.getAscentLine().getEndPoint();         Rectangle rect = new Rectangle(curBaseline.get(Vector.I1), curBaseline.get(Vector.I2), topRight.get(Vector.I1), topRight.get(Vector.I2));         float curFontSize = rect.getHeight();          //See if something has changed, either the baseline, the font or the font size         if ((this.lastBaseLine == null) || (curBaseline.get(Vector.I2) != lastBaseLine.get(Vector.I2)) || (curFontSize != lastFontSize) || (curFont != lastFont))         {             //if we've put down at least one span tag close it             if ((this.lastBaseLine != null))             {                 this.result.append("</span>").append("\n");             }             //If the baseline has changed then insert a line break             if ((this.lastBaseLine != null) && curBaseline.get(Vector.I2) != lastBaseLine.get(Vector.I2))             {                 this.result.append("<br />").append("\n");             }             //Create an HTML tag with appropriate styles             this.result.append(String.format("<span style=\"font-family:{%s};font-size:{%s}\">", curFont, curFontSize));         }          //Append the current text         this.result.append(renderInfo.getText() + " ");          //Set currently used properties         this.lastBaseLine = curBaseline;         this.lastFontSize = curFontSize;         this.lastFont = curFont;     }      public String getResultantText()     {         //If we wrote anything then we'll always have a missing closing tag so close it here         if (result.length() > 0)         {             result.append("</span>");         }         return result.toString();     }      //Not needed     public void beginTextBlock() { }     public void endTextBlock() { }     public void renderImage(ImageRenderInfo renderInfo) { }  } 
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!