Python convert html to text and mimic formatting

后端未结

关注

 4  1066

挽巷 2020-12-31 13:56

I\'m learning BeautifulSoup, and found many \"html2text\" solutions, but the one i\'m looking for should mimic the formatting:

One


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   长情又很酷
                                             
                
                
                (楼主)
            
              
              
                2020-12-31 14:30
              

            
            
                        
Python's built-in html.parser (HTMLParser in earlier versions) module can be easily extended to create a simple translator that you can tailor to your exact needs. It lets you hook into certain events as the parser eats through the HTML.

Due to its simple nature you cant navigate around the HTML tree like you could with Beautiful Soup (e.g. sibling, child, parent nodes etc) but for a simple case like yours it should be enough.

html.parser homepage

In your case you could use it like this by adding the appropriate formatting whenever a start-tag or end-tag of a specific type is encountered  :

from html.parser import HTMLParser
from os import linesep

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self, strict=False)
    def feed(self, in_html):
        self.output = ""
        super(MyHTMLParser, self).feed(in_html)
        return self.output
    def handle_data(self, data):
        self.output += data.strip()
    def handle_starttag(self, tag, attrs):
        if tag == 'li':
            self.output += linesep + '* '
        elif tag == 'blockquote' :
            self.output += linesep + linesep + '\t'
    def handle_endtag(self, tag):
        if tag == 'blockquote':
            self.output += linesep + linesep

parser = MyHTMLParser()
content = "One
Two"
print(linesep + "Example 1:")
print(parser.feed(content))
content = "Some textMore magnificent text hereFinal text"
print(linesep + "Example 2:")
print(parser.feed(content))

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复