Scrapy read list of URLs from file to scrape?

空扰寡人 提交于 2019-12-18 10:24:40

问题


I've just installed scrapy and followed their simple dmoz tutorial which works. I just looked up basic file handling for python and tried to get the crawler to read a list of URL's from a file but got some errors. This is probably wrong but I gave it a shot. Would someone please show me an example of reading a list of URL's into scrapy? Thanks in advance.

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    f = open("urls.txt")
    start_urls = f

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

回答1:


You were pretty close.

f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()

...better still would be to use the context manager to ensure the file's closed as expected:

with open("urls.txt", "rt") as f:
    start_urls = [url.strip() for url in f.readlines()]



回答2:


If Dmoz expects just filenames in the list, you have to call strip on each line. Otherwise you get a '\n' at the end of each URL.

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [l.strip() for l in open('urls.txt').readlines()]

Example in Python 2.7

>>> open('urls.txt').readlines()
['http://site.org\n', 'http://example.org\n', 'http://example.com/page\n']
>>> [l.strip() for l in open('urls.txt').readlines()]
['http://site.org', 'http://example.org', 'http://example.com/page']



回答3:


Arise with similar question when writing my Scrapy helloworld. Beside reading urls from a file, you might also need to input file name as an argument. This can be done by the Spider argument mechanism.

My example:

class MySpider(scrapy.Spider):                                                
    name = 'my'                                                               
    def __init__(self, config_file = None, *args, **kwargs):                    
        super(MySpider, self).__init__(*args, **kwargs)                       
        with open(config_file) as f:                                            
            self._config = json.load(f)                                         
        self._url_list = self._config['url_list']                             

    def start_requests(self):                                                   
        for url in self._url_list:                                              
            yield scrapy.Request(url = url, callback = self.parse)              


来源:https://stackoverflow.com/questions/8376630/scrapy-read-list-of-urls-from-file-to-scrape

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!