Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

别说谁变了你拦得住时间么 提交于 2019-12-12 02:48:55

问题


I need to automatically scan lots of html documents for ad banners that are surrounded by an anchor tag, e.g.:

<a href="http://ad_network.com/abc.html">
    <img src="ad_banner.jpg">
</a>

As a newbie with xpath, I can select such anchors via lxml like so:

text = '''
    <a href="http://ad_network.com/abc.html">
        <img src="ad_banner.jpg">
    </a>'''

root = lxml.html.fromstring(text)
print root.xpath('//a[contains(@href,("ad_network.")) or contains(@href,("other_ad_network."))][descendant::img]')

In the example I check on two different domains: "ad_network." and "other_ad_network.". However, there are over 25 domains to check and the xpath expression would get terribly long by connecting all those conatains-directives by "or". And I fear the expression would be pretty inefficient concerning CPU ressources. Is there some syntax for checking on multiple "contains" values?

I could get the concerned links also via regex in a single line of code. Yet, although the html code is normalized by lxml, regex seems never to be a good choice for that kind of work ... Any help appreciated!


回答1:


It might not be that bad just to do a bunch of 'or's. Build the xpath with python so that you don't get writer's cramp and then precompile it. The actual xpath code is in libxml and should be fast.

sites=['aaa', 'bbb']
contains = ' or '.join('contains(@href,(%s))' % site for site in sites)
anchor_xpath = etree.XPath('//a[%s][descendant::img]' % contains)


来源:https://stackoverflow.com/questions/17975960/select-all-anchor-tags-with-an-href-attribute-that-contains-one-of-multiple-valu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!