I\'ve just installed scrapy and followed their simple dmoz tutorial which works. I just looked up basic file handling for python and tried to get the crawler to read a list
Arise with similar question when writing my Scrapy helloworld. Beside reading urls from a file, you might also need to input file name as an argument. This can be done by the Spider argument mechanism.
My example:
class MySpider(scrapy.Spider):
name = 'my'
def __init__(self, config_file = None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
with open(config_file) as f:
self._config = json.load(f)
self._url_list = self._config['url_list']
def start_requests(self):
for url in self._url_list:
yield scrapy.Request(url = url, callback = self.parse)
You were pretty close.
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
...better still would be to use the context manager to ensure the file's closed as expected:
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
If Dmoz expects just filenames in the list, you have to call strip on each line. Otherwise you get a '\n' at the end of each URL.
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [l.strip() for l in open('urls.txt').readlines()]
Example in Python 2.7
>>> open('urls.txt').readlines()
['http://site.org\n', 'http://example.org\n', 'http://example.com/page\n']
>>> [l.strip() for l in open('urls.txt').readlines()]
['http://site.org', 'http://example.org', 'http://example.com/page']