How to scrap text included between various tags using scrapy

别等时光非礼了梦想. 提交于 2019-12-18 18:37:57

问题


I am trying to scrap product description from this link. But how do i scrap the whole text including text between tags. Here is the hxs object hxs.select('//div[@class="overview"]/div/text()').extract() but the original HTML :

These classic sneakers from
<b>Puma</b>
are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a
<b>leather and synthetic upper.</b>
A vulcanized non-slip rubber sole that is
<b>abrasion resistant ensures good traction.</b>

If i use the above mentioned hxs object i get this :

hxs.select('//div[@class="overview"]/div/text()').extract()
Output: 
[u'These classic sneakers from ',
 u' are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a ',
 u' A vulcanized non-slip rubber sole that is ',
 u' sportswear, jeans and tees.',
 u' Gently brush away dust or dirt using a soft cleaning brush.',
 u'\r\nUse a leather conditioner/wax and a brush for added shine.',
 u'Avoid contact with liquids.\xa0']

what i want is this:

These classic sneakers from Puma are best known for their neat and simple design. These
 basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a leather and synthetic upper.A vulcanized non-slip rubber sole 
that is abrasion resistant ensures good traction.

As you can see the text between is missing so can you tell me how do i extract the whole text from the page.


回答1:


Try taking whole content from tag with

 //div[@class="overview"]/div

and then you can remove tags from it with regex or leave them if they are not a problem.

Something like this regex:

 re.sub('<[^>]*>', '', mystring)


来源:https://stackoverflow.com/questions/17406992/how-to-scrap-text-included-between-various-tags-using-scrapy

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!