Regex for links in html text

前端未结

关注

 8  1668

I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the tags). I hav


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2020-12-16 05:22
              
            
            
                                                                       
Regexes with HTML get messy.  Just use a DOM parser like Beautiful Soup.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-12-16 05:29
              
            
            
                                                                       

  Shoudln't a link be a well-defined regex? This is a rather theoretical question,


I second PEZ's answer:


  I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.


As far as I know, any HTML tag may contain any number of nested tags. For example:

<a href="http://stackoverflow.com">stackoverflow</a>
<a href="http://stackoverflow.com"><i>stackoverflow</i></a>
<a href="http://stackoverflow.com"><b><i>stackoverflow</i></b></a>
...


Thus, in principle, to match a tag properly you must be able at least to match strings of the form:

BE
BBEE
BBBEEE
...
BBBBBBBBBBEEEEEEEEEE
...


where B means the beginning of a tag and E means the end. That is, you must be able to match strings formed by any number of B's followed by the same number of E's. To do that, your matcher must be able to "count", and regular expressions (i.e. finite state automata) simply cannot do that (in order to count, an automaton needs at least a stack). Referring to PEZ's answer, HTML is a context-free grammar, not a regular language.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2020-12-16 05:32
              
            
            
                                                                       
As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is  a good solution:

import urllib2
from BeautifulSoup import BeautifulSoup

html = urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")


As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across. 

If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2020-12-16 05:32
              
            
            
                                                                       
Answering your two subquestions there.


I've sometimes subclassed SGMLParser (included in the core Python distribution) and must say it's straight forward.
I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2020-12-16 05:38
              
            
            
                                                                       
In response to question #2 (shouldn't a link be a well defined regular expression) the answer is ... no.  

An HTML link structure is a recursive much like parens and braces in programming languages.  There must be an equal number of start and end constructs and the "link" expression can be nested within itself.  

To properly match a "link" expression a regex would be required to count the start and end tags.  Regular expressions are a class of Finite Automata.  By definition a Finite Automata cannot "count" constructs within a pattern.  A grammar is required to describe a recursive data structure such as this.  The inability for a regex to "count" is why you see programming languages described with Grammars as opposed to regular expressions.

So it is not possible to create a regex that will positively match 100% of all "link" expressions.  There are certainly regex's that will match a good deal of "link"'s with a high degree of accuracy but they won't ever be perfect. 

I wrote a blog article about this problem recently.  Regular Expression Limitations
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-12-16 05:40
              
            
            
                                                                       
It depends a bit on how the HTML is produced. If it's somewhat controlled you can get away with:

re.findall(r'''<link\s+.*?href=['"](.*?)['"].*?(?:</link|/)>''', html, re.I)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复