Convert HTML to plain text in Java

前端未结

关注

 6  2473

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2021-02-20 09:08
              
            
            
                                                                       
JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.

http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726 
My code: 

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2021-02-20 09:17
              
            
            
                                                                       
I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遇见更好的自我        
                
              
                            
                2021-02-20 09:18
              
            
            
                                                                       
You can use XSLT for this purpose. Take a look at this link which addresses a similar problem.

Hope it is helpful.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2021-02-20 09:23
              
            
            
                                                                       
Have your parser append text content and newlines to a StringBuilder.

final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
    public boolean readyForNewline;

    @Override
    public void handleText(final char[] data, final int pos) {
        String s = new String(data);
        sb.append(s.trim());
        readyForNewline = true;
    }

    @Override
    public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
            sb.append("\n");
            readyForNewline = false;
        }
    }

    @Override
    public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        handleStartTag(t, a, pos);
    }
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  日久生厌        
                
              
                            
                2021-02-20 09:23
              
            
            
                                                                       
Building on your example, with a hint from html to plain text? message:

import java.io.*;

import org.jsoup.*;
import org.jsoup.nodes.*;

public class TestJsoup
{
  public void SimpleParse()
  {
    try
    {
      Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
      // Trick for better formatting
      doc.body().wrap("<pre></pre>");
      String text = doc.text();
      // Converting nbsp entities
      text = text.replaceAll("\u00A0", " ");
      System.out.print(text);
    }
    catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  public static void main(String args[])
  {
    TestJsoup tjs = new TestJsoup();
    tjs.SimpleParse();
  }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2021-02-20 09:30
              
            
            
                                                                       
I would guess you could use the ParserCallback.

You would need to add code to support the tags that require special handling. There are:


handleStartTag
handleEndTag
handleSimpleTag


callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复