问题
I'm crawling through some directories with ASP.NET programming via Scrapy.
The pages to crawl through are encoded as such:
javascript:__doPostBack('ctl00$MainContent$List','Page$X')
where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack', but my guess is that I have to be trickier in order to pull the info from the javascript "link."
If it's easier to "unmask" each of the absolute links from the javascript code and save those to a csv, then use that csv to load requests into a new scraper, that's okay, too.
回答1:
This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:
- the idea here is to follow the pagination page by page passing around the current page in the Request.meta dictionary
- using a regular BaseSpider since there is some logic involved in the pagination
- it is important to provide
headerspretending to be a real browser - it is important to yield FormRequests with
dont_filter=Truesince we are basically making aPOSTrequest to the same URL but with different parameters
The code:
import re
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
HEADERS = {
'X-MicrosoftAjax': 'Delta=true',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'
class ExitRealtySpider(BaseSpider):
name = "exit_realty"
allowed_domains = ["exitrealty.com"]
start_urls = [URL]
def parse(self, response):
# submit a form (first page)
self.data = {}
for form_input in response.css('form#aspnetForm input'):
name = form_input.xpath('@name').extract()[0]
try:
value = form_input.xpath('@value').extract()[0]
except IndexError:
value = ""
self.data[name] = value
self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
self.data['__EVENTARGUMENT'] = 'Page$1'
return FormRequest(url=URL,
method='POST',
callback=self.parse_page,
formdata=self.data,
meta={'page': 1},
dont_filter=True,
headers=HEADERS)
def parse_page(self, response):
current_page = response.meta['page'] + 1
# parse agents (TODO: yield items instead of printing)
for agent in response.xpath('//a[@class="regtext"]/text()'):
print agent.extract()
print "------"
# request the next page
data = {
'__EVENTARGUMENT': 'Page$%d' % current_page,
'__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1),
'__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1),
'__ASYNCPOST': 'true',
'__EVENTTARGET': 'ctl00$MainContent$agentList',
'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
'': ''
}
return FormRequest(url=URL,
method='POST',
formdata=data,
callback=self.parse_page,
meta={'page': current_page},
dont_filter=True,
headers=HEADERS)
来源:https://stackoverflow.com/questions/28974838/crawling-through-pages-with-postback-data-javascript-python-scrapy