问题
I have a website have many pages like this:
mywebsite/?page=1
mywebsite/?page=2
...
...
...
mywebsite/?page=n
each page have links to players. when you click on any link, you go to the page of that player.
Users can add players so I will end up with this situation.
Player1 has a link in page=1.
Player10 has a link in page=2
after an hour
because users have added new players. i will have this situation.Player1 has a link in page=3
Player10 has a link in page=4
and the new players like Player100 and Player101 have links in page=1
I want to scrap on all players to get their information. However, I don't want to scrap on players that I have already scrap. My question is how to user the BaseDupeFilter in scrapy to identify that this player has been scraped and this not. Remember, I want to sracp on pages of the website because each page will have different players in each time.
Thank you.
回答1:
I'd take another approach and try not to query for the last player during spider run, but rather launch the spider with a pre calculated argument of the last scraped player:
scrapy crawl <my spider> -a last_player=X
then your spider may look like:
class MySpider(BaseSpider):
start_urls = ["http://....mywebsite/?page=1"]
...
def parse(self, response):
...
last_player_met = False
player_links = sel.xpath(....)
for player_link in player_links:
player_id = player_link.split(....)
if player_id < self.last_player:
yield Request(url=player_link, callback=self.scrape_player)
else:
last_player_met = True
if not last_player_met:
# try to xpath for 'Next' in pagination
# or use meta={} in request to loop over pages like
# "http://....mywebsite/?page=" + page_number
yield Request(url=..., callback=self.parse)
来源:https://stackoverflow.com/questions/21203920/python-scrapy-how-to-use-basedupefilter