How to replace some tags and remove others with line break in j2ee

落爺英雄遲暮 提交于 2019-12-13 00:06:34

问题


The main problem was to get the content of an html file and remove all tags.
I have read theses questions before:

1,2,3

after reading all of them I decided to use jsoup and it really helped me. I also realized how to keep line break and replace <p> tags with line break.
now my problem is that I have an html file which has a <H1> tag inside which the title of the whole content is available and I wanna keep it with a line break but with jsoup the fist paragraph comes exactly after the title without any line break. can any one help me plz?
the html code I have :

<DIV class="story-headline">
<H1 class="story-title">NFL 2014 predictions</H1>
</DIV>
<H3 class="story-deck">Our picks for playoff teams, surprises, Super Bowl</H3>
<P class="small lighttext">
<SPAN class="delimited">Posted: Sep 02, 2014 1:30 PM ET</SPAN>
<SPAN>Last Updated: Sep 04, 2014 10:27 AM ET</SPAN>
</P>

and the out put is:

NFL 2014 predictionsOur picks for playoff teams, surprises, Super Bowl

Posted: Sep 02, 2014 1:30 PM ETLast Updated: Sep 04, 2014 10:27 AM ET  

I want it to be:

NFL 2014 predictions  
Our picks for playoff teams, surprises, Super Bowl  
Posted: Sep 02, 2014 1:30 PM ET  
Last Updated: Sep 04, 2014 10:27 AM ET 

回答1:


You should hook the OutputSettings of the target Document, so try the following:

public class HtmlWithLineBreaks 
{

  public String getCleanHtml(Document document)
  {
    document.outputSettings(new Document.OutputSettings().prettyPrint(false)); //makes html() call preserve linebreaks and spacing
    return Jsoup.clean(document.html(),
        "",
        Whitelist.none(),
        new Document.OutputSettings().prettyPrint(false));
  }

  public static void main(String... args)
  {
    File input = new File("/path/to/some/input.html"); //Just replace the input with you own html file source
    Document document;
    try
    {
      document = Jsoup.parse(input, "UTF-8");
      String printOut = new HtmlWithLineBreaks().getCleanHtml(document);
      System.out.println(printOut);
    } catch (IOException e)
    {
      e.printStackTrace();
    } 
  }

}

Optionally you can insert custom linebreaks after your <h1> <div> wrapper if you are not satisfied with the provided output:

public String getCleanHtml(Document document)
{
  document.outputSettings(new Document.OutputSettings().prettyPrint(false));
  document.select("h1").parents().select("div").append("\n"); // Insert a linebreak after the h1 div parent.
  return Jsoup.clean(document.html(),
      "",
      Whitelist.none(),
      new Document.OutputSettings().prettyPrint(false));
}


来源:https://stackoverflow.com/questions/25674345/how-to-replace-some-tags-and-remove-others-with-line-break-in-j2ee

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!