web-crawler

What is the easiest way to run python scripts in a cloud server?

空扰寡人 提交于 2019-12-03 05:26:08
I have a web crawling python script that takes hours to complete, and is infeasible to run in its entirety on my local machine. Is there a convenient way to deploy this to a simple web server? The script basically downloads webpages into text files. How would this be best accomplished? Thanks! Since you said that performance is a problem and you are doing web-scraping, first thing to try is a Scrapy framework - it is a very fast and easy to use web-scraping framework. scrapyd tool would allow you to distribute the crawling - you can have multiple scrapyd services running on different servers

Typical politeness factor for a web crawler?

China☆狼群 提交于 2019-12-03 05:19:42
问题 What is a typical politeness factor for a web crawler? Apart from always obeying robot.txt Both the "Disallow:" and non standard "Crawl-delay:" But if a site does not specify an explicit crawl-delay what should the default value be set at? 回答1: The algorithm we use is: // If we are blocked by robots.txt // Make sure it is obeyed. // Our bots user-agent string contains a link to a html page explaining this. // Also an email address to be added to so that we never even consider their domain in

How to Programmatically take Snapshot of Crawled Webpages (in Ruby)?

ぐ巨炮叔叔 提交于 2019-12-03 05:06:17
What is the best solution to programmatically take a snapshot of a webpage? The situation is this: I would like to crawl a bunch of webpages and take thumbnail snapshots of them periodically, say once every few months, without having to manually go to each one. I would also like to be able to take jpg/png snapshots of websites that might be completely Flash/Flex, so I'd have to wait until it loaded to take the snapshot somehow. It would be nice if there was no limit to the number of thumbnails I could generate (within reason, say 1000 per day). Any ideas how to do this in Ruby? Seems pretty

Alternative to HtmlUnit

给你一囗甜甜゛ 提交于 2019-12-03 04:55:57
问题 I have been researching about the headless browsers available till to date and found HtmlUnit being used pretty extensively. Do we have any alternative to HtmlUnit with possible advantage compared to HtmlUnit? Thanks Nayn 回答1: As far as I know, HtmlUnit` is the most powerful headless browser. What are you issues with it? 回答2: There are many other libraries that you can use for this. If you need to scrape xml base data use JTidy. If you need to scrape specific data from HTML you can use Jsoup.

Robots.txt: allow only major SE

纵饮孤独 提交于 2019-12-03 04:39:53
Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders? User-agent: * Disallow: / User-agent: Googlebot Allow: / User-agent: Slurp Allow: / User-Agent: msnbot Disallow: Slurp is Yahoo's robot Why? Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary. But — if you insist on doing it anyway — that's what the User-Agent: line in robots.txt is for. User-agent: googlebot Disallow: User-agent: * Disallow:

How do Scrapy rules work with crawl spider

落花浮王杯 提交于 2019-12-03 04:19:24
问题 I have hard time to understand scrapy crawl spider rules. I have example that doesn't work as I would like it did, so it can be two things: I don't understand how rules work. I formed incorrect regex that prevents me to get results that I need. OK here it is what I want to do: I want to write crawl spider that will get all available statistics information from http://www.euroleague.net website. The website page that hosts all the information that I need for the start is here. Step 1 First

PhantomJS using too many threads

故事扮演 提交于 2019-12-03 04:10:32
I wrote a PhantomJS app to crawl over a site I built and check for a JavaScript file to be included. The JavaScript is similar to Google where some inline code loads in another JS file. The app looks for that other JS file which is why I used Phantom. What's the expected result? The console output should read through a ton of URLs and then tell if the script is loaded or not. What's really happening? The console output will read as expected for about 50 requests and then just start spitting out this error: 2013-02-21T10:01:23 [FATAL] QEventDispatcherUNIXPrivate(): Can not continue without a

Is it possible to plug a JavaScript engine with Ruby and Nokogiri?

情到浓时终转凉″ 提交于 2019-12-03 04:06:09
I'm writing an application to crawl some websites and scrape data from them. I'm using Ruby, Curl and Nokogiri to do this. In most cases it's straightforward and I only need to ping a URL and parse the HTML data. The setup works perfectly fine. However, in some scenarios, the websites retrieve data based on user input on some radio buttons. This invokes some JavaScript which fetches some more data from the server. The generated URL and posted data is determined by JavaScript code. Is it possible to use: A JavaScript library along with this setup which would be able to determine execute the

Scrapy Spider for JSON Response

蓝咒 提交于 2019-12-03 04:05:38
I am trying to write a spider which crawls through the following JSON response: http://gdata.youtube.com/feeds/api/standardfeeds/UK/most_popular?v=2&alt=json How would the spider look if I would want to crawl all the titles of the videos? All my Spiders dont work. from scrapy.spider import BaseSpider import json from youtube.items import YoutubeItem class MySpider(BaseSpider): name = "youtubecrawler" allowed_domains = ["gdata.youtube.com"] start_urls = ['http://www.gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json'] def parse(self, response): items [] jsonresponse = json

How to collect data from multiple pages into single data structure with scrapy

两盒软妹~` 提交于 2019-12-03 04:04:10
I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations. My problem is that this data is split across two levels in the website. The first page is, say, a list of names and ages with a link to each persons profile page. Their profile page lists their occupation. I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple paginations. But, how can I collect the data from the inner pages while keeping it linked to the