lxml requests on repl.it

问题

I'm trying lxml requests on Replit and I don't understand why it isn't working. The program doesn't stop running until the max retries, where I get this error:

Traceback (most recent call last): File "python", line 6, in requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.presidency.ucsb.edu', port=80): Max retries exceeded with url: /ws/index.php?pid=29400.html (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known',))

my code is quite straightforward:

from lxml import html
import requests

url = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29400.html'

r = requests.get(url)
tree = html.fromstring(r.content)

text = tree.xpath('//span[@class="displaytext"]/text()')

print(text)

How can I get this to run? I'm trying to get the content of that website, located in the "displaytext" span class. I've been using this Python guide for reference.

Python version 3.5

回答1:

I'm an engineer at Repl.it and this is a limitation with our platform. We don't currently allow outgoing network requests.

回答2:

switching to answer since it allows me to better line out things.

Have a look at the html of the website your are targeting. With this command you are selecting only 1 specific tag:

text = tree.xpath('//span[@class="displaytext"]/text()')

points to a specific span with class "displaytext"

You could change your code to:

text = tree.xpath('//span[@class="displaytext"]/..')
for element in text[0]:
    print element

This would select the span with class "displaytext" then select the parent of that span. And inside the for loop you would print all children of that parent.

Now it also shows the real problem: the paragraph elements are not in that list. Sorry, don't know an answer for that.

来源：https://stackoverflow.com/questions/41162897/lxml-requests-on-repl-it

标签

python

web-scraping

lxml