scrapy

Scrapy Body Text Only

情到浓时终转凉″ 提交于 2020-06-11 20:11:40
问题 I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the <body> tag. 回答1: Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body> ? (assuming it's nested in <html> ). It might be even simpler to use the //body selector: x.select("//body").extract() # extract body You can find more information

Is it possible to scrape all text messages from Whatsapp Web with Scrapy?

房东的猫 提交于 2020-06-11 05:45:40
问题 I've been experimenting with web scraping using Scrapy, and I was interested in retrieving all text messages from all chats on Whatsapp to use as training data for a Machine Learning project. I know there are websites that block web crawlers/scrapers, so I would like to know if it is possible to use Scrapy to obtain these messages, and if it isn't possible, what are some alternatives I can use? I understand that I can click on the "Email chat" option for each chat, but this might not be

Removing null value from scraped data without removing entire

怎甘沉沦 提交于 2020-06-01 07:38:07
问题 Am using scrapy to scrape data off the new york times website, but the scraped data are full of null values I don't want so in order to clean my extracted data I have changed the pipeline.py script. and it worked when I extract a single value or two it works like a charm. but when I extract multiple values and since there is at least one null value on each extracted row the algorithm ends up deleting almost all my data. is there a way to stop this from happening ? here is my spider file : # -

scrapy spider add health check before starting spider

六眼飞鱼酱① 提交于 2020-05-31 05:42:10
问题 I would like to not start the spider job if the external depended APIs(cassandra, mysql etc) are not reachable classHealthCheck: @staticmethod def is_healthy(): config = json.loads(configHelper.get_data()) cassandra_config = config['cassandra'] cluster = Cluster(cassandra_config['hosts'], port=cassandra_config['port']) session = cluster.connect(cassandra_config['keyspace']) try: session.execute('SELECT 1') except Exception as e: logging.error(e) return True I can invoke the is_healthy inside

Send email alert using Scrapy after multiple spiders have finished crawling

谁都会走 提交于 2020-05-31 03:36:01
问题 Just wondering what is the best way to implement this. I have 2 spiders and I want to send an email alert depending on what is scraped after the 2 spiders have finished crawling. I'm using a script based on the tutorial to run both spiders like so: if __name__ == "__main__": process = CrawlerProcess(get_project_settings()) process.crawl(NqbpSpider) process.crawl(GladstoneSpider) process.start() # the script will block here until the crawling is finished Is it best to call an email function

Send email alert using Scrapy after multiple spiders have finished crawling

守給你的承諾、 提交于 2020-05-31 03:32:31
问题 Just wondering what is the best way to implement this. I have 2 spiders and I want to send an email alert depending on what is scraped after the 2 spiders have finished crawling. I'm using a script based on the tutorial to run both spiders like so: if __name__ == "__main__": process = CrawlerProcess(get_project_settings()) process.crawl(NqbpSpider) process.crawl(GladstoneSpider) process.start() # the script will block here until the crawling is finished Is it best to call an email function

Scrapy Installation (Microsoft Visual C++ 14.0 is required)

戏子无情 提交于 2020-05-30 19:15:26
问题 I have been trying to install scrapy for days now through the command, pip install scrapy . After downloading the requirements, I am getting this error code. error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/ ---------------------------------------- ERROR: Command errored out with exit status 1: 'c:\users\pancore builders\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools,

Scrapy - Use feed exporter for a particular spider (and not others) in a project

生来就可爱ヽ(ⅴ<●) 提交于 2020-05-29 12:00:10
问题 ENVIRONMENT: Windows7, Python 3.6.5, Scrapy 1.5.1 PROBLEM DESCRIPTION: I have a scrapy project called project_github , which contains 3 spiders: spider1 , spider2 , spider3 . Each of these spiders scrapes data from a particular website individual to that spider. I am trying to automatically export a JSON file when a particular spider is executed, with the format: NameOfSpider_TodaysDate.json , so that from the command line I can: Execute the script scrapy crawl spider1 which returns spider1

scrapy: convert html string to HtmlResponse object

空扰寡人 提交于 2020-05-24 08:57:06
问题 I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath , similar to scrapy's response . How can I do it? 回答1: First of all, if it is for debugging or testing purposes, you can use the Scrapy shell: $ cat index.html <div id="test"> Test text </div> $ scrapy shell index.html >>> response.xpath('//div[@id="test"]/text()').extract()[0].strip() u'Test text' There are different objects available in the shell during the session,

ReactorNotRestartable with scrapy when using Google Cloud Functions

不问归期 提交于 2020-05-23 11:14:51
问题 I am trying to send multiple crawl requests with Google Cloud Functions. However, I seem to be getting the ReactorNotRestartable error. From other posts on StackOverflow, such as this one, I understand that this comes because it is not possible to restart the reactor, in particular when doing a loop. The way to solve this is by putting the start() outside the for loop. However, with Cloud Functions this is not possible as each request should be technically independent. Is the CrawlerProcess