问题
i am extracting data using scrapy and python.
the data sometimes include spaces. i was using normalize-space with xpath to remove those spaces like this:
xpath('normalize-space(.//li[2]/strong/text())').extract()
It words very good. However, now i want to use normalize-space with css selector.
I tried this:
car['Location'] = site.css('normalize-space(div[class=location]::text)').extract()
I got empty result though i get correct result if i removed the normalize-space..
please how to use it with css selector?
i tried
def normalize_whitespace(str):
import re
str = str.strip()
str = re.sub(r'\s+', ' ', str)
return str
and i called this fucntion like this:
car['Location'] = normalize_whitespace(site.css('div[class=location]::text').extract())
but i got empty result. why please?
回答1:
Unfortunately, XPath functions are not available with CSS selectors in Scrapy.
You could first translate your div[class=location]::text CSS selector to the equivalent XPath expression and then wrap it in normalize-space() as input to .xpath().
Anyhow, as you are only interested in a final "whitespace-normalized" string, you could achieve the same with a Python function on the output of the CSS selector extract.
See for example http://snipplr.com/view/50410/normalize-whitespace/ :
def normalize_whitespace(str):
import re
str = str.strip()
str = re.sub(r'\s+', ' ', str)
return str
If you include this function somewhere in your Scrapy project, you could use it like this:
car['Location'] = normalize_whitespace(
u''.join(site.css('div[class=location]::text').extract()))
or
car['Location'] = normalize_whitespace(
site.css('div[class=location]::text').extract()[0])
回答2:
css() compiles an xpath, so you can chain it to a xpath() normalising the spaces, so change your code to:
car['Location'] = site.css('normalize-space(div[class=location])').xpath('normalize-space(text())').extract()
来源:https://stackoverflow.com/questions/21118582/normalize-space-just-works-with-xpath-not-css-selector