XPath taking text with hyperlinks (Python)

问题

I'm new at using XPath (and I'm a relative beginner at Python in general). I'm trying to take the text out of the first paragraph of a Wikipedia page through it.

Take for instance the Python Page (https://en.wikipedia.org/wiki/Python_(programming_language))

if I get it into a variable

page = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")
tree = html.fromstring(page.content)

Then I know the desired paragraph is on XPath /html/body/div[3]/div[3]/div[4]/div/p[1]

So I take that text into a variable

first = tree.xpath("/html/body/div[3]/div[3]/div[4]/div/p[1]/text()")

Resulting on this output

[' is an ', ' ', ' for ', '. Created by ', ' and first released in 1991, Python has a design philosophy that emphasizes ', ', notably using ', '. It provides constructs that enable clear programming on both small and large scales.', '\n']

As you can see I'm missing the words/sentences that are inside of web links.

回答1:

The links themselves are nodes that you need to descend.

/html/body/div[3]/div[3]/div[4]/div/p[1]//text()

回答2:

Your XPath query matches the text child nodes of that node only. The text of the embedded live on another node and therefore excluded.

To descend use //text() as suggested; this will retrieve the text value of any descending node starting from the node in question.
```
/html/body/div[3]/div[3]/div[4]/div/p[1]//text()
```
Alternatively, you can select the node in question itself and retrieve the text using a parser method text_content() to retrieve the text including all child nodes.

lxml import html
import requests

page = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
tree = html.fromstring(page.content)
firstp = tree.xpath('/html/body/div[3]/div[3]/div[4]/div/p[1]')
firstp[0].text_content()

来源：https://stackoverflow.com/questions/51354279/xpath-taking-text-with-hyperlinks-python

标签

python

html

xpath

lxml