Why does this xpath fail using lxml in python?

前端 未结 3 1699
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-03 12:56

Here is an example web page I am trying to get data from. http://www.makospearguns.com/product-p/mcffgb.htm

The xpath was taken from chrome development tools, and f

3条回答
  •  既然无缘
    2020-12-03 13:38

    The xpath is simply wrong

    Here is snippet from the page:


      Home >

    You can see, that element with id being "v65-product-parent" is of typetableand has subelementtr`.

    There can be only one element with such id (otherwise it would be broken xml).

    The xpath is expecting tbody as child of given element (table) and there is none in whole page.

    This can be tested by

    >>> "tbody" in page.text
    False
    

    How Chrome came to that XPath?

    If you simply download this page by

    $ wget http://www.makospearguns.com/product-p/mcffgb.htm
    

    and review content of it, it does not contain a single element named tbody

    But if you use Chrome Developer Tools, you find some.

    How it comes here?

    This often happens, if JavaScript comes into play and generates some page content when in the browser. But as LegoStormtroopr noted, this is not our case and this time it is the browser, which modifies document to make it correct.

    How to get content of page dynamically modified within browser?

    You have to give some sort of browser a chance. E.g. if you use selenium, you would get it.

    byselenium.py

    from selenium import webdriver
    from lxml import html
    
    url = "http://www.makospearguns.com/product-p/mcffgb.htm"
    xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'
    
    browser = webdriver.Firefox()
    browser.get(url)
    html_source = browser.page_source
    print "test tbody", "tbody" in html_source
    
    tree = html.fromstring(html_source) 
    text = tree.xpath(xpath)
    print text
    

    what prints

    $ python byselenimum.py 
    test tbody True
    ['$149.95']
    

    Conclusions

    Selenium is great when it comes to changes within browser. However it is a bit heavy tool and if you can do it simpler way, do it that way. Lego Stormrtoopr have proposed such a simpler solution working on simply fetched web page.

    提交回复
    热议问题