scrape multiple addresses from multiple files in scrapy

独自空忆成欢 提交于 2021-01-28 08:32:11

问题


I have some JSON file in a directory. In any of this files, there are some information I need. the first property i need is the links list for "start_urls" in scrapy.

every file is for a different process, so its output must be separate. So I can't put all the links in all the json files into start_urls and run them together. i have to run the spider for everyfile.

how can i do this? here is my code so far:

import scrapy
from os import listdir
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
    name = 'hotel_info'
    allowed_domains = ['lastsecond.ir']
    # get start urls from links list of every file
    files = [f for f in listdir('lastsecond/hotels/') if 
    isfile(join('lastsecond/hotels/', f))]
    with open('lastsecond/hotels/' + files[0], 'r') as hotel_info:
        hotel = json.load(hotel_info)
    start_urls = hotel["links"]

    def parse(self, response):
        print("all good")

回答1:


I see two methods


First:

Run spider many times with different parameters. It will need less code.

You can create batch with many lines with different arguments added manually.

First argument is output filename -o result1.csv which scrapy will create automatically.
Second argument is input filename -a filename=process1.csv with links.

scrapy crawl hotel_info -o result1.csv -a filename=process1.csv
scrapy crawl hotel_info -o result2.csv -a filename=process2.csv
scrapy crawl hotel_info -o result3.csv -a filename=process3.csv
...

a it needs only to get filename in __init__

import scrapy
from os.path import isfile, join
import json

class HotelInfoSpider(scrapy.Spider):

    name = 'hotel_info'

    allowed_domains = ['lastsecond.ir']

    def __init__(self, filename, *args, **kwargs): # <-- filename
        super().__init__(*args, **kwargs)

        filename = join('lastsecond/hotels/', filename) 

        if isfile(filename):
            with open(filename) as f:
                data = json.load(f)
                self.start_urls = data['links']

    def parse(self, response):
        print('url:', response.url)

        yield {'url':, response.url, 'other': ...}

You can also use Python script with CrawlerProcess to run spider many times.

from scrapy.crawler import CrawlerProcess
import HotelInfoSpider
from os.path import isfile, join
import json

files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

for i, input_file in enumerate(files):
    output_file = 'result{}.csv'.format(i)
    c = CrawlerProcess({'FEED_FORMAT': 'csv','FEED_URI': output_file})
    c.crawl(HotelInfoSpider, filename=input_file) #input_file='process1.csv')
    c.start()

Or using scrapy.cmdline.execute()

import scrapy.cmdline
from os.path import isfile, join
import json

files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

for i, input_file in enumerate(files):
    output_file = 'result{}.csv'.format(i)
    scrapy.cmdline.execute(["scrapy", "crawl", "hotel_info", "-o", output_file, "-a" "filename=" + input_file])

Second:

It needs more code because you have to create Pipeline Exporter which will use different files to save results.

You have to use start_requests() and Request(..., meta=...) to create start_urls with requests which will have extra data in meta which you can use later to save in different files.

In parse() you have to get this extra from meta and add to item.

In pipeline exporter you have to get extra from item and open different file.

import scrapy
from os import listdir
from os.path import isfile, join
import json

class HotelInfoSpider(scrapy.Spider):

    name = 'hotel_info'

    allowed_domains = ['lastsecond.ir']

    def start_requests(self):

        # get start urls from links list of every file
        files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

        for i, filename in enumerate(files):
            with open('lastsecond/hotels/' + filename) as f:
                data = json.load(f)
                links = data["links"]
                for url in links:
                    yield scrapy.Request(url, meta={'extra': i})

    def parse(self, response):
        print('url:', response.url)
        extra = response.meta['extra']
        print('extra:', extra)

        yield {'url': response.url, 'extra': extra, 'other': ...}

pipelines.py

class MyExportPipeline(object):

    def process_item(self, item, spider):

        # get extra and use it in filename
        filename = 'result{}.csv'.format(item['extra'])

        # open file for appending
        with open(filename, 'a') as f:
            writer = csv.writer(f)

            # write only selected elements - skip `extra`
            row = [item['url'], item['other']
            writer.writerow(row)

        return item

settings.py

ITEM_PIPELINES = {
   'your_porject_name.pipelines.MyExportPipeline': 300,
}



回答2:


You could manage all the files with a dict:

d_hotel_info = {}
for file in files:
    with open('lastsecond/hotels/' + file, 'r') as hotel_info:
        hotel = json.load(hotel_info)
    d_hotel_info[file] = hotel

and then when you want to output, you reference the keys of d_hotel_info



来源:https://stackoverflow.com/questions/48122080/scrape-multiple-addresses-from-multiple-files-in-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!