lxml.html extract a string by searching for a keyword

坚强是说给别人听的谎言 提交于 2020-01-06 19:43:22

问题


I have a portion of html like below

<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>

I want to get the string "The keyword: The text".

I know that I can get xpath of above html using Chrome inspect or FF firebug, then select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages.

Hence, I'm thinking of below approach: Firstly, search for "The Keyword:" using (the code is for scrapy HtmlXPathSelector as I am not sure how to do the same in lxml.html)

hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')

When do pprint I get some return:

>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>

My question is how to get the wanted string: "The keyword: The text". I am thinking of how to determine xpath, if xpath is known, then of course I can get the wanted string.

I am open to any solution other than lxml.html.

Thanks.


回答1:


from lxml import html

s = '<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>'

tree = html.fromstring(s)
text = tree.text_content()
print text



回答2:


You can modify the XPath slightly to work with your current structure - by getting the parent of the label, then looking back for the fist a element, and taking the text from that...

>>> tree.xpath('//*[contains(text(), "The Keyword:")]/..//a/text()')
['The text']

But that may not be flexible enough...



来源:https://stackoverflow.com/questions/14004623/lxml-html-extract-a-string-by-searching-for-a-keyword

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!