问题
I have written web crawlers before using Python, but the page I am scraping has been resistant to my efforts so far. I am scraping data from a website using Python and BeautifulSoup. The way I'm doing it, there are two steps: generate a list of pages to be indexed, then parse those pages. The parsing part is easy, but I haven't figured out how to navigate the .aspx pages to so I can generate the links using Python. I can currently save the search pages manually in order to scrape them, but I would like to automate the entire process if it is possible.
The page in question: http://cookcountyassessor.com/Property_Search/Property_Search.aspx
I need to use the form to select a Township, then Neighborhood and Property Class, which leads through a few .aspx files, to get to the search results. I used BeautifulSoup to get a list of all and tags to submit as form data, modified the field I need to submit, and sent the request, but it does not give me the expected results when I open the next page (http://www.cookcountyassessor.com/Property_Search/nbhd_search.aspx?town=19).
Relevant code from the class I am constructing:
self.jar = http.cookiejar.CookieJar()
self.opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(self.jar))
self.page = ['http://cookcountyassessor.com/Property_Search/Property_Search.aspx']
self.page = self.page + ['http://cookcountyassessor.com/Property_Search/nbhd_search.aspx?town=19'] #Lemont
soup = BeautifulSoup(self.opener.open(self.page[0]))
inputs = soup.findAll("input") + soup.findAll("select")
params = {"__EVENTTARGET": "", "__EVENTARGUMENT": "", "__LASTFOCUS": ""}
for i in inputs:
try:
params[i['name']] = i['value']
except:
params[i['name']] = ''
params['ctl00$BodyContent$town1'] = self.code
self.params = params
params = urllib.parse.urlencode(params)
params = params.encode()
self.opener.open(self.page[0], params)
self.page1 = BeautifulSoup(self.opener.open(self.page[1]))
When I submit the form manually, the .aspx page seems to set a few cookies, then use a header redirect to a different page. Submitting with Python, I have no cookies in the jar, and the page does not seem to be accepting my post data. Am I missing something here, or will this be a royal pain in the neck to get around? I guess I'll start plugging in headers and see if it gets me anywhere...
来源:https://stackoverflow.com/questions/12591348/data-scraping-aspx