问题
I am trying to fetch some information from this website: http://www.go-on.fi/tyopaikat. As you can see, the table has a pagination, so whenever you click second or third page, the link will change too something http://www.go-on.fi/tyopaikat?start=20 (with the "start=" at the end). This is my spider code:
allowed_domains = ["go-on.fi"]
start_urls = ["http://www.go-on.fi/tyopaikat?start=0"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
titles = hxs.select("//tr")
for row in titles:
item = JobData()
item['header'] = row.select("./td[1]/a/text()").extract()
item['link'] = row.select("./td[1]/a/@href").extract()
items.append(item)
So my question is, how can I make the spider go through every page of the website (I mean the table page)?
回答1:
What you could do is set the start_urls to the main page then based on the number of pages shown in the footer pagination (in this case 3), use a loop to create a yield Request for each of the pages:
allowed_domains = ["go-on.fi"]
start_urls = ["http://www.go-on.fi/tyopaikat"]
def parse(self, response):
pages = response.xpath('//ul[@class="pagination"][last()-1]/a/text()').extract()
page = 1
start = 0
while page <= pages:
url = "http://www.go-on.fi/tyopaikat?start="+str(start)
start += 20
page += 1
yield Request(url, callback=self.parse_page)
def parse_page(self,response):
hxs = HtmlXPathSelector(response)
items = []
titles = hxs.select("//tr")
for row in titles:
item = JobData()
item['header'] = row.select("./td[1]/a/text()").extract()
item['link'] = row.select("./td[1]/a/@href").extract()
items.append(item)
来源:https://stackoverflow.com/questions/28410071/start-urls-in-scrapy