问题
I'm new at using XPath (and I'm a relative beginner at Python in general). I'm trying to take the text out of the first paragraph of a Wikipedia page through it.
Take for instance the Python Page (https://en.wikipedia.org/wiki/Python_(programming_language))
if I get it into a variable
page = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")
tree = html.fromstring(page.content)
Then I know the desired paragraph is on XPath /html/body/div[3]/div[3]/div[4]/div/p[1]
So I take that text into a variable
first = tree.xpath("/html/body/div[3]/div[3]/div[4]/div/p[1]/text()")
Resulting on this output
[' is an ', ' ', ' for ', '. Created by ', ' and first released in 1991, Python has a design philosophy that emphasizes ', ', notably using ', '. It provides constructs that enable clear programming on both small and large scales.', '\n']
As you can see I'm missing the words/sentences that are inside of web links.
回答1:
The links themselves are nodes that you need to descend.
/html/body/div[3]/div[3]/div[4]/div/p[1]//text()
回答2:
Your XPath query matches the text child nodes of that node only. The text of the embedded live on another node and therefore excluded.
To descend use
//text()
as suggested; this will retrieve the text value of any descending node starting from the node in question./html/body/div[3]/div[3]/div[4]/div/p[1]//text()
Alternatively, you can select the node in question itself and retrieve the text using a parser method
text_content()
to retrieve the text including all child nodes.
lxml import html
import requests
page = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
tree = html.fromstring(page.content)
firstp = tree.xpath('/html/body/div[3]/div[3]/div[4]/div/p[1]')
firstp[0].text_content()
来源:https://stackoverflow.com/questions/51354279/xpath-taking-text-with-hyperlinks-python