scrapy-spider | 易学教程

Scrapy Shell: twisted.internet.error.ConnectionLost although USER_AGENT is set

阅读更多关于 Scrapy Shell: twisted.internet.error.ConnectionLost although USER_AGENT is set

问题 When I try to scrape a certain web site (with both, spider and shell), I get the following error: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>] I found out that this can happen, when no user agent is set. But after setting it manually, I still got the same error. You can see the whole output of scrapy shell here: http://pastebin.com/ZFJZ2UXe Notes: I am not

docker running splash container but localhost does not load (windows 10)

阅读更多关于 docker running splash container but localhost does not load (windows 10)

I am following this tutorial to use splash to help with scraping webpages.I installed Docker toolbox and did these two steps: $ docker pull scrapinghub/splash $ docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash I think it is running correctly, based on the prompted message in Docker window, which looks like this: However, when I open the `localhost:8050' in a web browser, it says the localhost is not working. What might have gone wrong in this case? Thanks! VonC You have mapped the port to your docker host (the VM), but you have not port-forwarded that same port to your

Callback for redirected requests Scrapy

阅读更多关于 Callback for redirected requests Scrapy

I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones. I have the following code in the start_requests function: for user in users: yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p) But this self.parse_p is called only for the Non-302 requests. I guess you get a callback for the final page (after the redirect). Redirects are been taken care by the

How to scrape contents from multiple tables in a webpage

阅读更多关于 How to scrape contents from multiple tables in a webpage

I want to scrape contents from multiple tables in a webpage and the HTML code goes like this : <div class="fixtures-table full-table-medium" id="fixtures-data"> <h2 class="table-header"> Date 1 </h2> <table class="table-stats"> <tbody> <tr class='preview' id='match-row-EFBO755307'> <td class='details'> <p> <span class='team-home teams'> <a href='random_team'>team 1</a> </span> <span class='team-away teams'> <a href='random_team'>team 2</a> </span> </p> </td> </tr> <tr class='preview' id='match-row-EFBO755307'> <td class='match-details'> <p> <span class='team-home teams'> <a href='random_team'

Scrapy: Extract links and text

阅读更多关于 Scrapy: Extract links and text

问题 I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here. My items.py file is given below: import scrapy class IkeaItem(scrapy.Item): name = scrapy.Field() link = scrapy.Field() And the spider is given below: import scrapy from ikea.items import IkeaItem class IkeaSpider(scrapy.Spider): name = 'ikea' allowed_domains = ['http://www.ikea.com/'] start_urls = ['http://www.ikea.com/'] def parse(self, response): for sel in

Scrapy - Understanding CrawlSpider and LinkExtractor

阅读更多关于 Scrapy - Understanding CrawlSpider and LinkExtractor

问题 So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule

Multiple inheritance in scrapy spiders

阅读更多关于 Multiple inheritance in scrapy spiders

问题 Is it possible to create a spider which inherits the functionality from two base spiders, namely SitemapSpider and CrawlSpider? I have been trying to scrape data from various sites and realized that not all sites have listing of every page on the website, thus a need to use CrawlSpider. But CrawlSpider goes through a lot of junk pages and is kind of an overkill. What I would like to do is something like this: Start my Spider which is a subclass of SitemapSpider and pass regex matched

How to prevent getting blacklisted while scraping Amazon [closed]

阅读更多关于 How to prevent getting blacklisted while scraping Amazon [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 months ago . I try to scrape Amazon by Scrapy. but i have this error DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031> (failed 1 times): 503 Service Unavailable I think that it's because = Amazon is very good

Scrapy: catch responses with specific HTTP server codes

阅读更多关于 Scrapy: catch responses with specific HTTP server codes

We have a pretty much standard Scrapy project (Scrapy 0.24). I'd like to catch specific HTTP response codes, such as 200, 500, 502, 503, 504 etc. Something like that: class Spider(...): def parse(...): processes HTTP 200 def parse_500(...): processes HTTP 500 errors def parse_502(...): processes HTTP 502 errors ... How can we do that? alecxe By default, Scrapy only handles responses with status codes 200 - 300 . Let Scrapy handle 500 and 502 : class Spider(...): handle_httpstatus_list = [500, 502] Then, in the parse() callback, check response.status : def parse(response): if response.status ==

Scrapy: AttributeError: 'list' object has no attribute 'iteritems'

阅读更多关于 Scrapy: AttributeError: 'list' object has no attribute 'iteritems'

问题 This is my first question on stack overflow. Recently I want to use linked-in-scraper, so I downloaded and instruct "scrapy crawl linkedin.com" and get the below error message. For your information, I use anaconda 2.3.0 and python 2.7.11. All the related packages, including scrapy and six, are updated by pip before executing the program. Traceback (most recent call last): File "/Users/byeongsuyu/anaconda/bin/scrapy", line 11, in <module> sys.exit(execute()) File "/Users/byeongsuyu/anaconda