Get xpath() to return empty values

主宰稳场 提交于 2020-06-24 07:47:47

问题


I have a situation where I have a lot of <b> tags:

<b>12</b>
<b>13</b>
<b>14</b>
<b></b>
<b>121</b>

As you can see, the second last tag is empty. When I call:

sel.xpath('b/text()').extract()

Which gives me:

['12', '13', '14', '121']

I would like to have:

['12', '13', '14', '', '121']

Is there a way to get the empty value?


My current work around is to call:

sel.xpath('b').extract()

And then parsing through each html tag myself (the empty tags are here, which is what I want).


回答1:


This is where it is okay to manually strip the tags and get the text. You can use remove_tags() function provided by w3lib:

>>> from w3lib.html import remove_tags
>>> map(remove_tags, sel.xpath('//b').extract())
[u'12', u'13', u'14', u'', u'121']

Note that w3lib is a Scrapy dependency and is used internally. No need to install it separately.

Also, it would be better to use Scrapy Input and Output Processors here. Continue using sel.xpath('b') and define an input processor. For example, you can define it for specific Fields for the Item class:

from scrapy.contrib.loader.processor import MapCompose
from scrapy.item import Item, Field
from w3lib.html import remove_tags

class MyItem(Item):
    my_field = Field(input_processor=MapCompose(remove_tags)) 


来源:https://stackoverflow.com/questions/24459820/get-xpath-to-return-empty-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!