Scrape a web page that requires they give you a session cookie first

问题

I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file:

http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal

requires that I have a session cookie from the government site attached to the request.

How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python.

I tried this:

import urllib2
import cookielib

url = 'http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal'


def grab_data_with_cookie(cookie_jar, url):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_jar))
    data = opener.open(url)
    return data

cj = cookielib.CookieJar()

#grab the data 
data1 = grab_data_with_cookie(cj, url)
#the second time we do this, we get back the excel sheet.
data2 = grab_data_with_cookie(cj, url)

stuff2  = data2.read()

I'm pretty sure this isn't the best way to do this. How could I do this more cleanly, or even using the requests library?

回答1:

Using requests this is a trivial task:

>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)

>>> print r.cookies
{'requests-is': 'awesome'}

回答2:

Using cookies and urllib2:

import cookielib
import urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# use opener to open different urls

You can use the same opener for several connections:

data = [opener.open(url).read() for url in urls]

Or install it globally:

urllib2.install_opener(opener)

In the latter case the rest of the code looks the same with or without cookies support:

data = [urllib2.urlopen(url).read() for url in urls]

来源：https://stackoverflow.com/questions/9754807/scrape-a-web-page-that-requires-they-give-you-a-session-cookie-first

标签

python

google-app-engine

web-scraping

urllib2