Scrapy read list of URLs from file to scrape?

后端 未结 3 1732
执念已碎
执念已碎 2020-12-23 18:16

I\'ve just installed scrapy and followed their simple dmoz tutorial which works. I just looked up basic file handling for python and tried to get the crawler to read a list

相关标签:
3条回答
  • 2020-12-23 18:51

    Arise with similar question when writing my Scrapy helloworld. Beside reading urls from a file, you might also need to input file name as an argument. This can be done by the Spider argument mechanism.

    My example:

    class MySpider(scrapy.Spider):                                                
        name = 'my'                                                               
        def __init__(self, config_file = None, *args, **kwargs):                    
            super(MySpider, self).__init__(*args, **kwargs)                       
            with open(config_file) as f:                                            
                self._config = json.load(f)                                         
            self._url_list = self._config['url_list']                             
    
        def start_requests(self):                                                   
            for url in self._url_list:                                              
                yield scrapy.Request(url = url, callback = self.parse)              
    
    0 讨论(0)
  • 2020-12-23 18:54

    You were pretty close.

    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()
    

    ...better still would be to use the context manager to ensure the file's closed as expected:

    with open("urls.txt", "rt") as f:
        start_urls = [url.strip() for url in f.readlines()]
    
    0 讨论(0)
  • 2020-12-23 18:57

    If Dmoz expects just filenames in the list, you have to call strip on each line. Otherwise you get a '\n' at the end of each URL.

    class DmozSpider(BaseSpider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [l.strip() for l in open('urls.txt').readlines()]
    

    Example in Python 2.7

    >>> open('urls.txt').readlines()
    ['http://site.org\n', 'http://example.org\n', 'http://example.com/page\n']
    >>> [l.strip() for l in open('urls.txt').readlines()]
    ['http://site.org', 'http://example.org', 'http://example.com/page']
    
    0 讨论(0)
提交回复
热议问题