scrapy | 易学教程

Scrapy: effective way to test inline requests

阅读更多关于 Scrapy: effective way to test inline requests

问题 I wrote a spider using scrapy-inline-requests library. So the parse method in my spider looks something like this: @inline_requests def parse(self, response1): item = MyItem() loader = ItemLoader(item=item, response=response1) #extracting some data from the response1 try: response 2 = yield Request(some_url) #extracting some other data from response2 except Exception: self.logger.warning("Failed request to: %s", some_url) yield loader.load_item() I want to effectively test this method. I can

Scrapy: effective way to test inline requests

阅读更多关于 Scrapy: effective way to test inline requests

Nonblocking Scrapy pipeline to database

阅读更多关于 Nonblocking Scrapy pipeline to database

问题 I have a web scraper in Scrapy that gets data items. I want to asynchronously insert them into a database as well. For example, I have a transaction that inserts some items into my db using SQLAlchemy Core: def process_item(self, item, spider): with self.connection.begin() as conn: conn.execute(insert(table1).values(item['part1']) conn.execute(insert(table2).values(item['part2']) I understand that it's possible to use SQLAlchemy Core asynchronously with Twisted with alchimia. The

How to solve 403 error in scrapy

阅读更多关于 How to solve 403 error in scrapy

问题 I'm new to scrapy and I made the scrapy project to scrap data. I'm trying to scrapy the data from the website but I'm getting following error logs 2016-08-29 14:07:57 [scrapy] INFO: Enabled item pipelines: [] 2016-08-29 13:55:03 [scrapy] INFO: Spider opened 2016-08-29 13:55:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/robots.txt> (referer: None) 2016-08-29 13:55:04 [scrapy]

How to solve 403 error in scrapy

阅读更多关于 How to solve 403 error in scrapy

How to solve 403 error in scrapy

阅读更多关于 How to solve 403 error in scrapy

CSS Selector to get the element attribute value

阅读更多关于 CSS Selector to get the element attribute value

问题 The HTML structure is like this: <td class='hey'> <a href="https://example.com">First one</a> </td> This is my selector: m_URL = sel.css("td.hey a:nth-child(1)[href] ").extract() My selector now will output <a href="https://example.com">First one</a> , but I only want it to output the link itself: https://example.com . How can I do that? 回答1: Get the ::attr(value) from the a tag. Demo (using Scrapy shell): $ scrapy shell index.html >>> response.css('td.hey a:nth-child(1)::attr(href)').extract

For scrapy/selenium is there a way to go back to a previous page?

阅读更多关于 For scrapy/selenium is there a way to go back to a previous page?

问题 I essentially have a start_url that has my javascript search form and button, hence the need of selenium. I use selenium to select the appropriate items in my select box objects, and click the search button. The following page, I do some scrapy magic. However, now I want to go BACK to the original start_url and fill out a different object, etc. and repeat until no more. Essentially, I have tried making a for-loop and trying to get the browser to go back to the original response.url, but

How to extract data from tags which are child of another tag through scrapy and python?

阅读更多关于 How to extract data from tags which are child of another tag through scrapy and python?

问题 This is the html code from which i want to extract data. But whenever i run i am getting some random values. Please can anyone help me out with this. I want to extract the following: Mumbai, Maharastra, 1958, government, UGC and Indian Institute of Technology, Bombay . HTML: <div class="instituteInfo"> <ul class="clg-info"> <li> <a href="link here" target="_blank">Mumbai</a>, <a href="link here" target="_blank">Maharashtra</a> </li> <li>Estd : <span>1958</span></li> <li>Ownership : <span

Scrapy - set delay to retry middleware

阅读更多关于 Scrapy - set delay to retry middleware

问题 I'm using Scrapy-splash and I have a problem with memory. I can clearly see that memory used by docker python3 is gradually increasing until PC freezes. Can't figure out why it behaves this way because I have CONCURRENT_REQUESTS=3 and there is no way 3 HTML consumes 10GB RAM. So there is a workaround to set maxrss to some reasonable value. When RAM usage has this value, docker is restarted so RAM is flushed. But the problem is that for the time docker is down, scrapy continues sending