scrapy-spider | 易学教程

How to pass custom settings through CrawlerProcess in scrapy?

阅读更多关于 How to pass custom settings through CrawlerProcess in scrapy?

I have two CrawlerProcesses, each is calling different spider. I want to pass custom settings to one of these processes to save the output of the spider to csv, I thought I could do this: storage_settings = {'FEED_FORMAT': 'csv', 'FEED_URI': 'foo.csv'} process = CrawlerProcess(get_project_settings()) process.crawl('ABC', crawl_links=main_links, custom_settings=storage_settings ) process.start() and in my spider I read them as an argument: def __init__(self, crawl_links=None, allowed_domains=None, customom_settings=None, *args, **kwargs): self.start_urls = crawl_links self.allowed_domains =

Scrap multiple accounts aka multiple logins

阅读更多关于 Scrap multiple accounts aka multiple logins

I scrap successfully data for a single account. I want to scrap multiple accounts on a single website, multiple accounts needs multiple logins, I want a way how to manage login/logout ? dangra you can scrape multiples accounts in parallel using multiple cookiejars per account session, see "cookiejar" request meta key at http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=cookiejar#std:reqmeta-cookiejar To clarify: suppose we have an array of accounts in settings.py : MY_ACCOUNTS = [ {'login': 'my_login_1', 'pwd': 'my_pwd_1'}, {'login': 'my_login_2', 'pwd': 'my_pwd_2'},

How many items has been scraped per start_url

阅读更多关于 How many items has been scraped per start_url

I use scrapy to crawl 1000 urls and store scraped item in a mongodb. I'd to know how many items have been found for each url. From scrapy stats I can see 'item_scraped_count': 3500 However, I need this count for each start_url separately. There is also referer field for each item that I might use to count each url items manually: 2016-05-24 15:15:10 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=6w-_ucPV674> (referer: https://www.youtube.com/results?q=billys&sp=EgQIAhAB) But I wonder if there is a built-in support from scrapy. eLRuLL challenge accepted! there isn't

AttributeError: 'module' object has no attribute 'Spider'

阅读更多关于 AttributeError: 'module' object has no attribute 'Spider'

I just started to learn scrapy. So I followed the scrapy documentation . I just written the first spider mentioned in that site. import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body) Upon running this scrapy crawl dmoz command on project's root directory, it shows the below error.

IMDB scrapy get all movie data

阅读更多关于 IMDB scrapy get all movie data

问题 I am working on a class project and trying to get all IMDB movie data (titles, budgets. etc.) up until 2016. I adopted the code from https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py. My thought is: from i in range(1874,2016) (since 1874 is the earliest year shown on http://www.imdb.com/year/), direct the program to the corresponding year's website, and grab the data from that url. But the problem is, each page for each year only show 50 movies, so after crawling

Unable to use proxies one by one until there is a valid response

阅读更多关于 Unable to use proxies one by one until there is a valid response

I've written a script in python's scrapy to make a proxied requests using either of the newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. However, the problem is the proxy my script chooses to use may not be the good one always so sometimes it doesn't fetch valid response. How can I let my script keep trying with different proxies until there is a valid response? My script so far: import scrapy import random import requests from itertools import cycle from bs4 import BeautifulSoup from scrapy.http.request import

Callback for redirected requests Scrapy

阅读更多关于 Callback for redirected requests Scrapy

问题 I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones. I have the following code in the start_requests function: for user in users: yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p) But this self.parse_p is called only for the Non-302 requests. 回答1: I

docker running splash container but localhost does not load (windows 10)

阅读更多关于 docker running splash container but localhost does not load (windows 10)

问题 I am following this tutorial to use splash to help with scraping webpages.I installed Docker toolbox and did these two steps: $ docker pull scrapinghub/splash $ docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash I think it is running correctly, based on the prompted message in Docker window, which looks like this: However, when I open the `localhost:8050' in a web browser, it says the localhost is not working. What might have gone wrong in this case? Thanks! 回答1: You have

Submit form that renders dynamically with Scrapy?

阅读更多关于 Submit form that renders dynamically with Scrapy?

I'm trying to submit a dynamically generated user login form using Scrapy and then parse the HTML on the page that corresponds to a successful login. I was wondering how I could do that with Scrapy or a combination of Scrapy and Selenium. Selenium makes it possible to find the element on the DOM, but I was wondering if it would be possible to "give control back" to Scrapy after getting the full HTML in order to allow it to carry out the form submission and save the necessary cookies, session data etc. in order to scrape the page. Basically, the only reason I thought Selenium was necessary was

Pyinstaller scrapy error:

阅读更多关于 Pyinstaller scrapy error:

After installing all dependencies for scrapy on windows 32bit. I've tried to build an executable from my scrapy spider. Spider script "runspider.py" works ok when running as "python runspider.py" Building executable "pyinstaller --onefile runspider.py": C:\Users\username\Documents\scrapyexe>pyinstaller --onefile runspider.py 19 INFO: wrote C:\Users\username\Documents\scrapyexe\runspider.spec 49 INFO: Testing for ability to set icons, version resources... 59 INFO: ... resource update available 59 INFO: UPX is not available. 89 INFO: Processing hook hook-os 279 INFO: Processing hook hook-time