scrapy

Scrapy: effective way to test inline requests

半世苍凉 提交于 2021-02-06 12:48:44
问题 I wrote a spider using scrapy-inline-requests library. So the parse method in my spider looks something like this: @inline_requests def parse(self, response1): item = MyItem() loader = ItemLoader(item=item, response=response1) #extracting some data from the response1 try: response 2 = yield Request(some_url) #extracting some other data from response2 except Exception: self.logger.warning("Failed request to: %s", some_url) yield loader.load_item() I want to effectively test this method. I can

Scrapy: effective way to test inline requests

前提是你 提交于 2021-02-06 12:48:05
问题 I wrote a spider using scrapy-inline-requests library. So the parse method in my spider looks something like this: @inline_requests def parse(self, response1): item = MyItem() loader = ItemLoader(item=item, response=response1) #extracting some data from the response1 try: response 2 = yield Request(some_url) #extracting some other data from response2 except Exception: self.logger.warning("Failed request to: %s", some_url) yield loader.load_item() I want to effectively test this method. I can

Nonblocking Scrapy pipeline to database

允我心安 提交于 2021-02-06 11:56:14
问题 I have a web scraper in Scrapy that gets data items. I want to asynchronously insert them into a database as well. For example, I have a transaction that inserts some items into my db using SQLAlchemy Core: def process_item(self, item, spider): with self.connection.begin() as conn: conn.execute(insert(table1).values(item['part1']) conn.execute(insert(table2).values(item['part2']) I understand that it's possible to use SQLAlchemy Core asynchronously with Twisted with alchimia. The

How to solve 403 error in scrapy

陌路散爱 提交于 2021-02-06 11:10:11
问题 I'm new to scrapy and I made the scrapy project to scrap data. I'm trying to scrapy the data from the website but I'm getting following error logs 2016-08-29 14:07:57 [scrapy] INFO: Enabled item pipelines: [] 2016-08-29 13:55:03 [scrapy] INFO: Spider opened 2016-08-29 13:55:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/robots.txt> (referer: None) 2016-08-29 13:55:04 [scrapy]

How to solve 403 error in scrapy

不羁岁月 提交于 2021-02-06 11:07:44
问题 I'm new to scrapy and I made the scrapy project to scrap data. I'm trying to scrapy the data from the website but I'm getting following error logs 2016-08-29 14:07:57 [scrapy] INFO: Enabled item pipelines: [] 2016-08-29 13:55:03 [scrapy] INFO: Spider opened 2016-08-29 13:55:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/robots.txt> (referer: None) 2016-08-29 13:55:04 [scrapy]

How to solve 403 error in scrapy

谁都会走 提交于 2021-02-06 11:07:34
问题 I'm new to scrapy and I made the scrapy project to scrap data. I'm trying to scrapy the data from the website but I'm getting following error logs 2016-08-29 14:07:57 [scrapy] INFO: Enabled item pipelines: [] 2016-08-29 13:55:03 [scrapy] INFO: Spider opened 2016-08-29 13:55:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min) 2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/robots.txt> (referer: None) 2016-08-29 13:55:04 [scrapy]

CSS Selector to get the element attribute value

此生再无相见时 提交于 2021-02-06 10:50:45
问题 The HTML structure is like this: <td class='hey'> <a href="https://example.com">First one</a> </td> This is my selector: m_URL = sel.css("td.hey a:nth-child(1)[href] ").extract() My selector now will output <a href="https://example.com">First one</a> , but I only want it to output the link itself: https://example.com . How can I do that? 回答1: Get the ::attr(value) from the a tag. Demo (using Scrapy shell): $ scrapy shell index.html >>> response.css('td.hey a:nth-child(1)::attr(href)').extract

For scrapy/selenium is there a way to go back to a previous page?

一世执手 提交于 2021-02-05 20:22:37
问题 I essentially have a start_url that has my javascript search form and button, hence the need of selenium. I use selenium to select the appropriate items in my select box objects, and click the search button. The following page, I do some scrapy magic. However, now I want to go BACK to the original start_url and fill out a different object, etc. and repeat until no more. Essentially, I have tried making a for-loop and trying to get the browser to go back to the original response.url, but

How to extract data from tags which are child of another tag through scrapy and python?

浪尽此生 提交于 2021-02-05 12:23:01
问题 This is the html code from which i want to extract data. But whenever i run i am getting some random values. Please can anyone help me out with this. I want to extract the following: Mumbai, Maharastra, 1958, government, UGC and Indian Institute of Technology, Bombay . HTML: <div class="instituteInfo"> <ul class="clg-info"> <li> <a href="link here" target="_blank">Mumbai</a>, <a href="link here" target="_blank">Maharashtra</a> </li> <li>Estd : <span>1958</span></li> <li>Ownership : <span

Scrapy - set delay to retry middleware

邮差的信 提交于 2021-02-04 19:44:46
问题 I'm using Scrapy-splash and I have a problem with memory. I can clearly see that memory used by docker python3 is gradually increasing until PC freezes. Can't figure out why it behaves this way because I have CONCURRENT_REQUESTS=3 and there is no way 3 HTML consumes 10GB RAM. So there is a workaround to set maxrss to some reasonable value. When RAM usage has this value, docker is restarted so RAM is flushed. But the problem is that for the time docker is down, scrapy continues sending