IMDB scrapy get all movie data

问题

I am working on a class project and trying to get all IMDB movie data (titles, budgets. etc.) up until 2016. I adopted the code from https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py.

My thought is: from i in range(1874,2016) (since 1874 is the earliest year shown on http://www.imdb.com/year/), direct the program to the corresponding year's website, and grab the data from that url.

But the problem is, each page for each year only show 50 movies, so after crawling the 50 movies, how can I move on to the next page? And after crawling each year, how can I move on to next year? This is my code for the parsing url part so far, but it is only able to crawls 50 movies for a particular year.

class tutorialSpider(scrapy.Spider):
    name = "tutorial"
    allowed_domains = ["imdb.com"]
    start_urls = ["http://www.imdb.com/search/title?year=2014,2014&title_type=feature&sort=moviemeter,asc"] 

    def parse(self, response):
            for sel in response.xpath("//*[@class='results']/tr/td[3]"):
                item = MovieItem()
                item['Title'] = sel.xpath('a/text()').extract()[0]
                item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
                request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
                request.meta['item'] = item
                yield request

回答1:

You can use CrawlSpiders to simplify your task. As you'll see below, start_requests dynamically generates the list of URLs while parse_page only extracts the movies to crawl. Finding and following the 'Next' link is done by the rules attribute.

I agree with @Padraic Cunningham that hard-coding values is not a great idea. I've added spider arguments so that you can call: scrapy crawl imdb -a start=1950 -a end=1980 (the scraper will default to 1874-2016 if it doesn't get any arguments).

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from imdbyear.items import MovieItem

class IMDBSpider(CrawlSpider):
    name = 'imdb'
    rules = (
        # extract links at the bottom of the page. note that there are 'Prev' and 'Next'
        # links, so a bit of additional filtering is needed
        Rule(LinkExtractor(restrict_xpaths=('//*[@id="right"]/span/a')),
            process_links=lambda links: filter(lambda l: 'Next' in l.text, links),
            callback='parse_page',
            follow=True),
    )

    def __init__(self, start=None, end=None, *args, **kwargs):
      super(IMDBSpider, self).__init__(*args, **kwargs)
      self.start_year = int(start) if start else 1874
      self.end_year = int(end) if end else 2016

    # generate start_urls dynamically
    def start_requests(self):
        for year in range(self.start_year, self.end_year+1):
            yield scrapy.Request('http://www.imdb.com/search/title?year=%d,%d&title_type=feature&sort=moviemeter,asc' % (year, year))

    def parse_page(self, response):
        for sel in response.xpath("//*[@class='results']/tr/td[3]"):
            item = MovieItem()
            item['Title'] = sel.xpath('a/text()').extract()[0]
            # note -- you had 'MianPageUrl' as your scrapy field name. I would recommend fixing this typo
            # (you will need to change it in items.py as well)
            item['MainPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
            request = scrapy.Request(item['MainPageUrl'], callback=self.parseMovieDetails)
            request.meta['item'] = item
            yield request
    # make sure that the dynamically generated start_urls are parsed as well
    parse_start_url = parse_page

    # do your magic
    def parseMovieDetails(self, response):
        pass

回答2:

you can use the below piece of code to follow the next page
#'a.lister-page-next.next-page::attr(href)' is the selector to get the next page link

next_page = response.css('a.lister-page-next.nextpage::attr(href)').extract_first() # joins current and next page url
if next_page is not None:
           next_page = response.urljoin(next_page)
           yield scrapy.Request(next_page, callback=self.parse) # calls parse function again when crawled to next page

回答3:

I figured out a very dumb way to solve this. I put all the links in the start_urls. Better solution would be very much appreciated!

class tutorialSpider(scrapy.Spider):
    name = "tutorial"
    allowed_domains = ["imdb.com"]
    start_urls = []
    for i in xrange(1874, 2017):
        for j in xrange(1, 11501, 50): 
        # since the largest number of movies for a year to have is 11,400 (2016)
        start_url = "http://www.imdb.com/search/title?sort=moviemeter,asc&start=" + str(j) + "&title_type=feature&year=" + str(i) + "," + str(i)
        start_urls.append(start_url)

    def parse(self, response):
        for sel in response.xpath("//*[@class='results']/tr/td[3]"):
            item = MovieItem()
            item['Title'] = sel.xpath('a/text()').extract()[0]
            item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
            request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
            request.meta['item'] = item
            yield request

回答4:

The code that @Greg Sadetsky has provided needs some minor changes. Well only one change that is in the first line of parse_page method.

    Just change xpath in the for loop from:
    response.xpath("//*[@class='results']/tr/td[3]"):
    to
    response.xpath("//*[contains(@class,'lister-item-content')]/h3"):

This worked like a charm for me!

来源：https://stackoverflow.com/questions/35819404/imdb-scrapy-get-all-movie-data

标签

python

python-2.7

scrapy

scrapy-spider