Python convert html to text and mimic formatting

后端未结

关注

 4  1064

挽巷 2020-12-31 13:56

I\'m learning BeautifulSoup, and found many \"html2text\" solutions, but the one i\'m looking for should mimic the formatting:

One


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   情深已故
                                             
                
                
                (楼主)
            
              
              
                2020-12-31 14:31
              

            
            
                        
While using samaspin's solution, if there are non english unicode characters, then the parser stops working and just returns an empty string. Initialising the parser for each loop ensures that the even if the parser object gets corrupted, it does not return empty string for the subsequent parsings. Adding to samaspin's solution ,the handling of the 
 tag as well.
In term of processing the HTML code and not cleaning the html tags, the subsequent tags can be added and their expected output written in the function handle_starttag
            class MyHTMLParser(HTMLParser):
            """
            This class will be used to clean the html tags whilst ensuring the
            format is maintained. Therefore all the whitespces, newlines, linebrakes, etc are
            converted from html tags to their respective counterparts in python.

            """

            def __init__(self):
                HTMLParser.__init__(self)

            def feed(self, in_html):
                self.output = ""
                super(MyHTMLParser, self).feed(in_html)
                return self.output

            def handle_data(self, data):
                self.output += data.strip()

            def handle_starttag(self, tag, attrs):
                if tag == 'li':
                    self.output += linesep + '* '
                elif tag == 'blockquote':
                    self.output += linesep + linesep + '\t'
                elif tag == 'br':
                    self.output += linesep + '\n'

            def handle_endtag(self, tag):
                if tag == 'blockquote':
                    self.output += linesep + linesep


        parser = MyHTMLParser()

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复