Scrapy Extract number from page text with regex

↘锁芯ラ 提交于 2019-12-21 19:52:59

问题


I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:

def parse(self, response):
        title = response.xpath('//title/text()').extract()
        units = response.xpath('//body/text()').re(r"Units: (\d)")
        print title, units

I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.

I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.

EDIT: As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:

<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>

Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.

units = response.xpath('string(//body)').re("(Units: [\d]+)")

回答1:


Try:

response.xpath('string(//body)').re(r"Units: (\d)")


来源:https://stackoverflow.com/questions/26723378/scrapy-extract-number-from-page-text-with-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!