XPath taking text with hyperlinks (Python)

自闭症网瘾萝莉.ら 提交于 2019-12-14 03:03:44

问题


I'm new at using XPath (and I'm a relative beginner at Python in general). I'm trying to take the text out of the first paragraph of a Wikipedia page through it.

Take for instance the Python Page (https://en.wikipedia.org/wiki/Python_(programming_language))

if I get it into a variable

page = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")
tree = html.fromstring(page.content)

Then I know the desired paragraph is on XPath /html/body/div[3]/div[3]/div[4]/div/p[1]

So I take that text into a variable

first = tree.xpath("/html/body/div[3]/div[3]/div[4]/div/p[1]/text()")

Resulting on this output

[' is an ', ' ', ' for ', '. Created by ', ' and first released in 1991, Python has a design philosophy that emphasizes ', ', notably using ', '. It provides constructs that enable clear programming on both small and large scales.', '\n']

As you can see I'm missing the words/sentences that are inside of web links.


回答1:


The links themselves are nodes that you need to descend.

/html/body/div[3]/div[3]/div[4]/div/p[1]//text()



回答2:


Your XPath query matches the text child nodes of that node only. The text of the embedded live on another node and therefore excluded.

  1. To descend use //text() as suggested; this will retrieve the text value of any descending node starting from the node in question.

    /html/body/div[3]/div[3]/div[4]/div/p[1]//text()
    
  2. Alternatively, you can select the node in question itself and retrieve the text using a parser method text_content() to retrieve the text including all child nodes.

lxml import html
import requests

page = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
tree = html.fromstring(page.content)
firstp = tree.xpath('/html/body/div[3]/div[3]/div[4]/div/p[1]')
firstp[0].text_content()


来源:https://stackoverflow.com/questions/51354279/xpath-taking-text-with-hyperlinks-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!