lxml requests on repl.it

大兔子大兔子 提交于 2019-12-12 01:57:11

问题


I'm trying lxml requests on Replit and I don't understand why it isn't working. The program doesn't stop running until the max retries, where I get this error:

Traceback (most recent call last): File "python", line 6, in requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.presidency.ucsb.edu', port=80): Max retries exceeded with url: /ws/index.php?pid=29400.html (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known',))

my code is quite straightforward:

from lxml import html
import requests

url = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29400.html'

r = requests.get(url)
tree = html.fromstring(r.content)

text = tree.xpath('//span[@class="displaytext"]/text()')

print(text)

How can I get this to run? I'm trying to get the content of that website, located in the "displaytext" span class. I've been using this Python guide for reference.

Python version 3.5


回答1:


I'm an engineer at Repl.it and this is a limitation with our platform. We don't currently allow outgoing network requests.




回答2:


switching to answer since it allows me to better line out things.

Have a look at the html of the website your are targeting. With this command you are selecting only 1 specific tag:

text = tree.xpath('//span[@class="displaytext"]/text()')

points to a specific span with class "displaytext"

You could change your code to:

text = tree.xpath('//span[@class="displaytext"]/..')
for element in text[0]:
    print element

This would select the span with class "displaytext" then select the parent of that span. And inside the for loop you would print all children of that parent.

Now it also shows the real problem: the paragraph elements are not in that list. Sorry, don't know an answer for that.



来源:https://stackoverflow.com/questions/41162897/lxml-requests-on-repl-it

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!