Scrapy Tutorial Example

前端 未结 2 1543
深忆病人
深忆病人 2021-01-24 07:33

Looking to see if someone can point me in the right direction in regards to using Scrapy in python.

I\'ve been trying to follow the example for several days and still ca

2条回答
  •  误落风尘
    2021-01-24 08:00

    Here is the correction of the Scrapy code to extract details from DMOZ:

    import scrapy
    
    class MozSpider(scrapy.Spider):
    name = "moz"
    allowed_domains = ["www.dmoz.org"]
    start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
    'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/']
    
        def parse(self, response):
            sites = response.xpath('//div[@class="title-and-desc"]')
            for site in sites:
                name = site.xpath('a/div[@class="site-title"]/text()').extract_first()
                url = site.xpath('a/@href').extract_first()
                description = site.xpath('div[@class="site-descr "]/text()').extract_first().strip()
    
                yield{'Name':name, 'URL':url, 'Description':description}
    


    To export it into CSV, open the spider folder in your Terminal/CMD and type:

    scrapy crawl moz -o result.csv
    



    Here is another basic Scrapy tutorial: to extract company details from YellowPages:

    import scrapy
    
    class YlpSpider(scrapy.Spider):
    name = "ylp"
    allowed_domains = ["www.yellowpages.com"]
    start_urls = ['http://www.yellowpages.com/search?search_terms=Translation&geo_location_terms=Virginia+Beach%2C+VA']
    
    
        def parse(self, response):
            companies = response.xpath('//*[@class="info"]')
    
            for company in companies:
                name = company.xpath('h3/a/span[@itemprop="name"]/text()').extract_first()
                phone = company.xpath('div/div[@class="phones phone primary"]/text()').extract_first()
                website = company.xpath('div/div[@class="links"]/a/@href').extract_first()
    
                yield{'Name':name,'Phone':phone, 'Website':website}
    


    To export it into CSV, open the spider folder in your Terminal/CMD and type:

    scrapy crawl ylp -o result.csv
    



    This Scrapy code is to extract company details from Yelp:

    import scrapy
    
    class YlpSpider(scrapy.Spider):
        name = "yelp"
        allowed_domains = ["www.yelp.com"]
        start_urls = ['https://www.yelp.com/search?find_desc=Java+Developer&find_loc=Denver,+CO']
    
    
        def parse(self, response):
            companies = response.xpath('//*[@class="biz-listing-large"]')
    
            for company in companies:
                name = company.xpath('.//span[@class="indexed-biz-name"]/a/span/text()').extract_first()
                address1 = company.xpath('.//address/text()').extract_first('').strip()
                address2 = company.xpath('.//address/text()[2]').extract_first('').strip()  # '' means the default attribute if not found to avoid adding None.
                address = address1 + " - " + address2
                phone = company.xpath('.//*[@class="biz-phone"]/text()').extract_first().strip()
                website = "https://www.yelp.com" + company.xpath('.//@href').extract_first()
    
                yield{'Name':name, 'Address':address, 'Phone':phone, 'Website':website}
    


    To export it into CSV, open the spider folder in your Terminal/CMD and type:

    scrapy crawl yelp -o result.csv
    



    • This is a comprehensive online course on Scrapy:

    https://www.udemy.com/scrapy-tutorial-web-scraping-with-python/?couponCode=STACK39243009-SCRAPY


    All the best!

提交回复
热议问题