Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

后端未结

关注

 4  1209

I have the following:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2020-12-16 16:05
              
            
            
                                                                       
You could solve this with just a couple lines of gazpacho:

from gazpacho import Soup

html = """\
<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>
"""

soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']

Which would output:
'/file-one/additional'

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2020-12-16 16:15
              
            
            
                                                                       

First of all, use a different text editor that doesn't use curly quotes.
Second, remove the text=True flag from the soup.find_all

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2020-12-16 16:24
              
            
            
                                                                       
The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.  

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])


Or you could use a list comprehension, if you prefer one-liners.  

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]


Or you could pass a lambda to .find_all().

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)




If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.

Using .find_all().  

links = [a['href'] for a in soup.find_all('a', href=True)]


Using .select() with CSS selectors.  

links = [a['href'] for a in soup.select('a[href]')]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2020-12-16 16:24
              
            
            
                                                                       
You can also use attrs to get the href tag with regex search

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复

Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

(adsbygoogle = window.adsbygoogle || []).push({});