Python: urllib/urllib2/httplib confusion

I'm trying to test the functionality of a web app by scripting a login sequence in Python, but I'm having some troubles.

Here's what I need to do:

Do a POST with a few parameters and headers.
Follow a redirect
Retrieve the HTML body.

Now, I'm relatively new to python, but the two things I've tested so far haven't worked. First I used httplib, with putrequest() (passing the parameters within the URL), and putheader(). This didn't seem to follow the redirects.

Then I tried urllib and urllib2, passing both headers and parameters as dicts. This seems to return the login page, instead of the page I'm trying to login to, I guess it's because of lack of cookies or something.

Am I missing something simple?

Thanks.

Focus on urllib2 for this, it works quite well. Don't mess with httplib, it's not the top-level API.

What you're noting is that urllib2 doesn't follow the redirect.

You need to fold in an instance of HTTPRedirectHandler that will catch and follow the redirects.

Further, you may want to subclass the default HTTPRedirectHandler to capture information that you'll then check as part of your unit testing.

cookie_handler= urllib2.HTTPCookieProcessor( self.cookies )
redirect_handler= HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)

You can then use this opener object to POST and GET, handling redirects and cookies properly.

You may want to add your own subclass of HTTPHandler to capture and log various error codes, also.

Here's my take on this issue.

#!/usr/bin/env python

import urllib
import urllib2


class HttpBot:
    """an HttpBot represents one browser session, with cookies."""
    def __init__(self):
        cookie_handler= urllib2.HTTPCookieProcessor()
        redirect_handler= urllib2.HTTPRedirectHandler()
        self._opener = urllib2.build_opener(redirect_handler, cookie_handler)

    def GET(self, url):
        return self._opener.open(url).read()

    def POST(self, url, parameters):
        return self._opener.open(url, urllib.urlencode(parameters)).read()


if __name__ == "__main__":
    bot = HttpBot()
    ignored_html = bot.POST('https://example.com/authenticator', {'passwd':'foo'})
    print bot.GET('https://example.com/interesting/content')
    ignored_html = bot.POST('https://example.com/deauthenticator',{})

@S.Lott, thank you. Your suggestion worked for me, with some modification. Here's how I did it.

data = urllib.urlencode(params)
url = host+page
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)

cookies = CookieJar()
cookies.extract_cookies(response,request)

cookie_handler= urllib2.HTTPCookieProcessor( cookies )
redirect_handler= HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)

response = opener.open(request)

I had to do this exact thing myself recently. I only needed classes from the standard library. Here's an excerpt from my code:

from urllib import urlencode
from urllib2 import urlopen, Request

# encode my POST parameters for the login page
login_qs = urlencode( [("username",USERNAME), ("password",PASSWORD)] )

# extract my session id by loading a page from the site
set_cookie = urlopen(URL_BASE).headers.getheader("Set-Cookie")
sess_id = set_cookie[set_cookie.index("=")+1:set_cookie.index(";")]

# construct headers dictionary using the session id
headers = {"Cookie": "session_id="+sess_id}

# perform login and make sure it worked
if "Announcements:" not in urlopen(Request(URL_BASE+"login",headers=headers), login_qs).read():
    print "Didn't log in properly"
    exit(1)

# here's the function I used after this for loading pages
def download(page=""):
    return urlopen(Request(URL_BASE+page, headers=headers)).read()

# for example:
print download(URL_BASE + "config")

I'd give Mechanize (http://wwwsearch.sourceforge.net/mechanize/) a shot. It may well handle your cookie/headers transparently.

Try twill - a simple language that allows users to browse the Web from a command-line interface. With twill, you can navigate through Web sites that use forms, cookies, and most standard Web features. More to the point, twill is written in Python and has a python API, e.g:

from twill import get_browser
b = get_browser()

b.go("http://www.python.org/")
b.showforms()

Besides the fact that you may be missing a cookie, there might be some field(s) in the form that you are not POSTing to the webserver. The best way would be to capture the actual POST from a web browser. You can use LiveHTTPHeaders or WireShark to snoop the traffic and mimic the same behaviour in your script.

Funkload is a great web app testing tool also. It wraps webunit to handle the browser emulation, then gives you both functional and load testing features on top.

来源：https://stackoverflow.com/questions/301924/python-urllib-urllib2-httplib-confusion

标签

python

http

urllib2