Regex for links in html text

前端未结
关注
 8  1667
旧巷少年郎 2020-12-16 04:42
I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the tags). I hav

      
      
        
          8条回答        

        
                    
            
            
                         
                
              
              
                
                   隐瞒了意图╮
                                             
                
                
                (楼主)
            
              
              
                2020-12-16 05:29
              

            
            
                        

  Shoudln't a link be a well-defined regex? This is a rather theoretical question,


I second PEZ's answer:


  I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.


As far as I know, any HTML tag may contain any number of nested tags. For example:

stackoverflow
stackoverflow
stackoverflow
...


Thus, in principle, to match a tag properly you must be able at least to match strings of the form:

BE
BBEE
BBBEEE
...
BBBBBBBBBBEEEEEEEEEE
...


where B means the beginning of a tag and E means the end. That is, you must be able to match strings formed by any number of B's followed by the same number of E's. To do that, your matcher must be able to "count", and regular expressions (i.e. finite state automata) simply cannot do that (in order to count, an automaton needs at least a stack). Referring to PEZ's answer, HTML is a context-free grammar, not a regular language.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它8个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复