How to scrape aspx pages with python

元气小坏坏 提交于 2020-01-14 03:35:07

问题


I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.

I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page

__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)
__VIEWSTATEGENERATOR:F92D01D0
__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)
BrowserWidth:1243
BrowserHeight:705
ctl00$ContentPlaceHolder1$scrollPos:0
ctl00$ContentPlaceHolder1$txtName:david
ctl00$ContentPlaceHolder1$chkIgnorePartyType:on
ctl00$ContentPlaceHolder1$txtFromDate:
ctl00$ContentPlaceHolder1$txtThruDate:
ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)
ctl00$ContentPlaceHolder1$cboDocType:(ALL)
ctl00$ContentPlaceHolder1$cboTown:(ALL)
ctl00$ContentPlaceHolder1$txtPinNum:
ctl00$ContentPlaceHolder1$txtBook:
ctl00$ContentPlaceHolder1$txtPage:
ctl00$ContentPlaceHolder1$txtUDFNum:
ctl00$ContentPlaceHolder1$txtCaseNum:
ctl00$ContentPlaceHolder1$cmdSearch:Search

All the ones in caps are hidden. I have also managed to figure out the results structure.

My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.

import re
import urlparse
import mechanize

from bs4 import BeautifulSoup

class DocumentFinderScraper(object):
    def __init__(self):
        self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"
        self.br = mechanize.Browser()
        self.br.addheaders = [('User-agent', 
                               'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]

##TO DO

    ##submit form

    #get return URL

    #scrape results

    #analyze pagination


if __name__ == '__main__':
    scraper = DocumentFinderScraper()
    scraper.scrape()

Any help would be dearly appreciated


回答1:


I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:

As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.

For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.

This code should log you in using Selenium:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

usr = ""
pwd = ""

driver = webdriver.Firefox() 
driver.get("https://www.searchiqs.com/nybro/")
assert "IQS" in driver.title
elem = driver.find_element_by_id("txtUserID")
elem.send_keys(usr)
elem = driver.find_element_by_id("txtPassword")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)


来源:https://stackoverflow.com/questions/38546665/how-to-scrape-aspx-pages-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!