HTTPError: Not Found in urllib2 and BeautifulSoup?

巧了我就是萌 提交于 2019-12-12 03:58:39

问题


from lxml import html
import requests

# Initial attempt to scrape HTML from link using BeautifulSoup

obama_4427 = requests.get('http://millercenter.org/president/obama/speech-4427')
obama_4427_tree = html.fromstring(obama_4427.text)

# The speech text itself is stored in the HTML with an Xpath 
# of '//*[@id="transcript"]/p' and is a <div>

obama_4427_text = obama_4427_tree.xpath('//div[@id="transcript"]/p')
print(obama_4427_text)

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString

obama_4427_url = 'http://millercenter.org/president/obama/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()

# Second attempt, using User-Agent

import httplib
httplib.HTTPConnection.debuglevel = 1

import urllib2
request = urllib2.Request(obama_4427_url)
opener = urllib2.build_opener()
feeddata = opener.open(request).read()

I end up getting the following error code:

HTTPError: Not Found

In the second attempt, I tried to identify myself as a User-Agent to gain access to be able to scrape this speech, but was unsuccessful. What am I missing here?

I'm running Python 2.7 in Anaconda Spyder, by the way.

来源:https://stackoverflow.com/questions/32593031/httperror-not-found-in-urllib2-and-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!