问题
from lxml import html
import requests
# Initial attempt to scrape HTML from link using BeautifulSoup
obama_4427 = requests.get('http://millercenter.org/president/obama/speech-4427')
obama_4427_tree = html.fromstring(obama_4427.text)
# The speech text itself is stored in the HTML with an Xpath
# of '//*[@id="transcript"]/p' and is a <div>
obama_4427_text = obama_4427_tree.xpath('//div[@id="transcript"]/p')
print(obama_4427_text)
import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://millercenter.org/president/obama/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
# Second attempt, using User-Agent
import httplib
httplib.HTTPConnection.debuglevel = 1
import urllib2
request = urllib2.Request(obama_4427_url)
opener = urllib2.build_opener()
feeddata = opener.open(request).read()
I end up getting the following error code:
HTTPError: Not Found
In the second attempt, I tried to identify myself as a User-Agent to gain access to be able to scrape this speech, but was unsuccessful. What am I missing here?
I'm running Python 2.7 in Anaconda Spyder, by the way.
来源:https://stackoverflow.com/questions/32593031/httperror-not-found-in-urllib2-and-beautifulsoup