Why does this xpath fail using lxml in python?

前端未结

关注

 3  1699

爱一瞬间的悲伤 2020-12-03 12:56

Here is an example web page I am trying to get data from. http://www.makospearguns.com/product-p/mcffgb.htm

The xpath was taken from chrome development tools, and f

3条回答

既然无缘 (楼主)

2020-12-03 13:38

The xpath is simply wrong

Here is snippet from the page:


      

      
        
          
  
Home > 


You can see, that element with id being "v65-product-parent" is of typetableand has subelementtr`.

There can be only one element with such id (otherwise it would be broken xml).

The xpath is expecting tbody as child of given element (table) and there is none in whole page.

This can be tested by

>>> "tbody" in page.text
False


How Chrome came to that XPath?

If you simply download this page by

$ wget http://www.makospearguns.com/product-p/mcffgb.htm


and review content of it, it does not contain a single element named tbody

But if you use Chrome Developer Tools, you find some.

How it comes here?

This often happens, if JavaScript comes into play and generates some page content when in the browser. But as LegoStormtroopr noted, this is not our case and this time it is the browser, which modifies document to make it correct.

How to get content of page dynamically modified within browser?

You have to give some sort of browser a chance. E.g. if you use selenium, you would get it.

byselenium.py

from selenium import webdriver
from lxml import html

url = "http://www.makospearguns.com/product-p/mcffgb.htm"
xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'

browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source
print "test tbody", "tbody" in html_source

tree = html.fromstring(html_source) 
text = tree.xpath(xpath)
print text


what prints

$ python byselenimum.py 
test tbody True
['$149.95']


Conclusions

Selenium is great when it comes to changes within browser. However it is a bit heavy tool and if you can do it simpler way, do it that way. Lego Stormrtoopr have proposed such a simpler solution working on simply fetched web page.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复
            
          
        
      
       
      
    
    
          
 
     
 
        热议问题