Scrapy - select xpath with a regular expression

问题

Part of the html that I am scraping looks like this:

<h2> <span class="headline" id="Profile">Profile</span></h2>
<ul><li> <b>Name</b> Albert Einstein
</li><li> <b>Birth Name:</b> Alberto Ein
</li><li> <b>Birthdate:</b> December 24, 1986
</li><li> <b>Birthplace:</b> <a href="/Ulm" title="Dest">Ulm</a>, Germany
</li><li> <b>Height:</b> 178cm
</li><li> <b>Blood Type:</b> A
</li></ul>

I want to extract each component - so name, birth name, birthday, etc.

To extract the name I do:

a_name = response.xpath('//ul/li/b[contains(text(),"Name")]/../descendant::text()').extract()

then I check that a_name is not an empty list and I call:

"".join(a_name[2:]).strip()

I do this for consistency since in Birthplace, I just want to extract the text, excluding all the html attributes. So I would get Ulm, Germany.

The problem is that when I use contains(text(), "Name"), the entry for Birth Name also matches. How can I avoid this when building my selector?

With a regular expression I could specify something like text() matches ^Name.* since the text Name may or may not be followed by a colon and or space.

Is there a way to use regular expressions to solve this problem?

回答1:

If you want to use regex, you could try this:

response.xpath('//ul/li/b[text()[re:test(., '^Name.*')]]/../descendant::text()')

But you are better of using starts-with

response.xpath('//ul/li/b[starts-with(text(),"Name")]/../descendant::text()')

回答2:

Try to extract text for all element li, then parse the text list, like this:

from scrapy.selector import Selector
source = '''
<h2> <span class="headline" id="Profile">Profile</span></h2>
<ul><li> <b>Name</b> Albert Einstein
</li><li> <b>Birth Name:</b> Alberto Ein
</li><li> <b>Birthdate:</b> December 24, 1986
</li><li> <b>Birthplace:</b> <a href="/Ulm" title="Dest">Ulm</a>, Germany
</li><li> <b>Height:</b> 178cm
</li><li> <b>Blood Type:</b> A
</li></ul>
'''

a_name = Selector(text=source).xpath('//ul/li//text()').extract()
all_li = ''.join(a_name).strip().split("\n")
print(all_li)

all_li will give you:

[u'Name Albert Einstein', u' Birth Name: Alberto Ein', u' Birthdate: December 24, 1986', u' Birthplace: Ulm, Germany', u' Height: 178cm', u' Blood Type: A']

来源：https://stackoverflow.com/questions/45384382/scrapy-select-xpath-with-a-regular-expression

标签

python

python-2.7

xpath

web-scraping

scrapy