Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

问题

I need to automatically scan lots of html documents for ad banners that are surrounded by an anchor tag, e.g.:

<a href="http://ad_network.com/abc.html">
    <img src="ad_banner.jpg">
</a>

As a newbie with xpath, I can select such anchors via lxml like so:

text = '''
    <a href="http://ad_network.com/abc.html">
        <img src="ad_banner.jpg">
    </a>'''

root = lxml.html.fromstring(text)
print root.xpath('//a[contains(@href,("ad_network.")) or contains(@href,("other_ad_network."))][descendant::img]')

In the example I check on two different domains: "ad_network." and "other_ad_network.". However, there are over 25 domains to check and the xpath expression would get terribly long by connecting all those conatains-directives by "or". And I fear the expression would be pretty inefficient concerning CPU ressources. Is there some syntax for checking on multiple "contains" values?

I could get the concerned links also via regex in a single line of code. Yet, although the html code is normalized by lxml, regex seems never to be a good choice for that kind of work ... Any help appreciated!

回答1:

It might not be that bad just to do a bunch of 'or's. Build the xpath with python so that you don't get writer's cramp and then precompile it. The actual xpath code is in libxml and should be fast.

sites=['aaa', 'bbb']
contains = ' or '.join('contains(@href,(%s))' % site for site in sites)
anchor_xpath = etree.XPath('//a[%s][descendant::img]' % contains)

来源：https://stackoverflow.com/questions/17975960/select-all-anchor-tags-with-an-href-attribute-that-contains-one-of-multiple-valu

标签

python

xpath

operators

lxml

contains