how to extract a list of label value with scrapy when html tag are missing

南笙酒味 提交于 2020-01-15 20:13:53

问题


I am currently processing a document with

<b> label1 </b>
value1 <br>
<b> label2 </b>
value2 <br>
....

I can't figure out a clean approach to xpath with scrapy. here is my best implementation

hxs = HtmlXPathSelector(response)

section = hxs.select(..............)
values = section.select("text()[preceding-sibling::b/text()]"):
labels = section.select("text()/preceding-sibling::b/text()"):

but I am not comfortable with this approach for matching nodes of both lists through index. I'd rather iterate through 1 list ( values or labels) and query the matching nodes as relative xpath. Such as :

values = section.select("text()[preceding-sibling::b/text()]"):
for value in values:
    value.select("/preceding-sibling::b/text()"):

I have been tweaking this expression but always return no matchs

UPDATE

I am looking for robust method that will tolerate "noise", e.g. :

garbage1<br>
<b> label1 </b>
value1 <br>
<b> label2 </b>
value2 <br>
garbage2<br>
<b> label3 </b>
value3 <br>
<div>garbage3</div>

回答1:


Edit: sorry I use lxml, but it will work the same with Scrapy's own selector.

For the specific HTML you have given this will work:

>>> s = """<b> label1 </b>
... value1 <br>
... <b> label2 </b>
... value2 <br>
... """
>>> 
>>> import lxml.html
>>> lxml.html.fromstring(s)
<Element span at 0x10fdcadd0>
>>> soup = lxml.html.fromstring(s)
>>> soup.xpath("//text()")
[' label1 ', '\nvalue1 ', ' label2 ', '\nvalue2 ']
>>> res = soup.xpath("//text()")
>>> for i in xrange(0, len(res), 2):
...     print res[i:i+2]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
>>> 

Edit 2:

>>> bs = etree.xpath("//text()[preceding-sibling::b/text()]")
>>> for b in bs:
...     if b.getparent().tag == "b":
...         print [b.getparent().text, b]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
[' label3 ', '\nvalue3 ']

Also for what it's worth, if you are looping over selected elements you want to do "./foo" in your xpath inside the for loop, not "/foo".



来源:https://stackoverflow.com/questions/16745209/how-to-extract-a-list-of-label-value-with-scrapy-when-html-tag-are-missing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!