Extract part of a regex match

前端未结

关注

 9  2001

北海茫月 2020-11-22 13:01

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search(\'.*\', html, re.IGNOR


      
      
        
          9条回答        

        
                    
            
            
                         
                
              
              
                
                   谎友^
                                             
                
                
                (楼主)
            
              
              
                2020-11-22 13:56
              

            
            
                        
I'd think this should suffice:

#!python
import re
pattern = re.compile(r'([^<]*)', re.MULTILINE|re.IGNORECASE)
pattern.search(text)


... assuming that your text (HTML) is in a variable named "text."

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.

However ...

Don't use regular expressions for HTML parsing in Python.  Use an HTML parser!  (Unless you're going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package.  It isn't in the standard libraries (yet) but is wide recommended for this purpose.

Another option is: lxml ... which is written for properly structured (standards conformant) HTML.  But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它9个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复