问题
I am trying to scrap product description from this link. But how do i scrap the whole text including text between tags. Here is the hxs object
hxs.select('//div[@class="overview"]/div/text()').extract() but the original HTML :
These classic sneakers from
<b>Puma</b>
are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a
<b>leather and synthetic upper.</b>
A vulcanized non-slip rubber sole that is
<b>abrasion resistant ensures good traction.</b>
If i use the above mentioned hxs object i get this :
hxs.select('//div[@class="overview"]/div/text()').extract()
Output:
[u'These classic sneakers from ',
u' are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a ',
u' A vulcanized non-slip rubber sole that is ',
u' sportswear, jeans and tees.',
u' Gently brush away dust or dirt using a soft cleaning brush.',
u'\r\nUse a leather conditioner/wax and a brush for added shine.',
u'Avoid contact with liquids.\xa0']
what i want is this:
These classic sneakers from Puma are best known for their neat and simple design. These
basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a leather and synthetic upper.A vulcanized non-slip rubber sole
that is abrasion resistant ensures good traction.
As you can see the text between is missing so can you tell me how do i extract the whole text from the page.
回答1:
Try taking whole content from tag with
//div[@class="overview"]/div
and then you can remove tags from it with regex or leave them if they are not a problem.
Something like this regex:
re.sub('<[^>]*>', '', mystring)
来源:https://stackoverflow.com/questions/17406992/how-to-scrap-text-included-between-various-tags-using-scrapy