Regex for links in html text

前端未结
关注
 8  1679
旧巷少年郎 2020-12-16 04:42
I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the tags). I hav

      
      
        
          8条回答        

        
                    
            
            
                         
                
              
              
                
                   不知归路
                                             
                
                
                (楼主)
            
              
              
                2020-12-16 05:38
              

            
            
                        
In response to question #2 (shouldn't a link be a well defined regular expression) the answer is ... no.  

An HTML link structure is a recursive much like parens and braces in programming languages.  There must be an equal number of start and end constructs and the "link" expression can be nested within itself.  

To properly match a "link" expression a regex would be required to count the start and end tags.  Regular expressions are a class of Finite Automata.  By definition a Finite Automata cannot "count" constructs within a pattern.  A grammar is required to describe a recursive data structure such as this.  The inability for a regex to "count" is why you see programming languages described with Grammars as opposed to regular expressions.

So it is not possible to create a regex that will positively match 100% of all "link" expressions.  There are certainly regex's that will match a good deal of "link"'s with a high degree of accuracy but they won't ever be perfect. 

I wrote a blog article about this problem recently.  Regular Expression Limitations
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它8个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复