web-crawler | 易学教程

Prevent site data from being crawled and ripped

阅读更多关于 Prevent site data from being crawled and ripped

问题 I'm looking into building a content site with possibly thousands of different entries, accessible by index and by search. What are the measures I can take to prevent malicious crawlers from ripping off all the data from my site? I'm less worried about SEO, although I wouldn't want to block legitimate crawlers all together. For example, I thought about randomly changing small bits of the HTML structure used to display my data, but I guess it wouldn't really be effective. 回答1: Any site that it

Crawling the Google Play store

阅读更多关于 Crawling the Google Play store

I want to crawl the Google Play store to download the web pages of all the android application (All the webpages with the following base url: https://play.google.com/store/apps/ ). I checked the robots.txt file of the play store and it disallows crawling these URLs. Also, when I browse the Google Play store I can only see top applications up to 3 pages for each of the categories. How can I get the other application pages? If anyone has tried crawling the Google Play please let me know the following things: a) Were you successful in crawling the play store. If yes, please let me know how you

Python Package For Multi-Threaded Spider w/ Proxy Support?

阅读更多关于 Python Package For Multi-Threaded Spider w/ Proxy Support?

问题 Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks! 回答1: is's simple to implement this in python. The urlopen() function works transparently with proxies which do not require authentication. In a

Running Multiple spiders in scrapy for 1 website in parallel?

阅读更多关于 Running Multiple spiders in scrapy for 1 website in parallel?

问题 I want to crawl a website with 2 parts and my script is not as fast as I need. Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part? I tried to have 2 different classes, and run them scrapy crawl firstSpider scrapy crawl secondSpider but i think that it is not smart. I read the documentation of scrapyd but I don't know if it's good for my case. 回答1: I think what you are looking for is something like this: import scrapy from scrapy.crawler

How to force scrapy to crawl duplicate url?

阅读更多关于 How to force scrapy to crawl duplicate url?

问题 I am learning Scrapy a web crawling framework. by default it does not crawl duplicate urls or urls which scrapy have already crawled. How to make Scrapy to crawl duplicate urls or urls which have already crawled? I tried to find out on internet but could not find relevant help. I found DUPEFILTER_CLASS = RFPDupeFilter and SgmlLinkExtractor from Scrapy - Spider crawls duplicate urls but this question is opposite of what I am looking 回答1: You're probably looking for the dont_filter=True

How to print html source to console with phantomjs

阅读更多关于 How to print html source to console with phantomjs

I just downloaed and installed phantomjs on my machine. I copy and pasted the following script into a file called hello.js: var page = require('webpage').create(); var url = 'https://www.google.com' page.onLoadStarted = function () { console.log('Start loading...'); }; page.onLoadFinished = function (status) { console.log('Loading finished.'); phantom.exit(); }; page.open(url); I'd like to print the complete html source (in this case from the google page) to a file or to the console. How do I do this? Spent some time to read the documentation, it should be obvious afterwards. var page =

Passing arguments to process.crawl in Scrapy python

阅读更多关于 Passing arguments to process.crawl in Scrapy python

问题 I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start()

How to get a web page's source code from Java [duplicate]

阅读更多关于 How to get a web page's source code from Java [duplicate]

This question already has an answer here: How do you Programmatically Download a Webpage in Java 10 answers I just want to retrieve any web page's source code from Java. I found lots of solutions so far, but I couldn't find any code that works for all the links below: http://www.cumhuriyet.com.tr?hn=298710 http://www.fotomac.com.tr/Yazarlar/Olcay%20%C3%87ak%C4%B1r/2011/11/23/hesap-makinesi http://www.sabah.com.tr/Gundem/2011/12/23/basbakan-konferansta-konusuyor# The main problem for me is that some codes retrieve web page source code, but with missing ones. For example the code below does not

python: [Errno 10054] An existing connection was forcibly closed by the remote host

阅读更多关于 python: [Errno 10054] An existing connection was forcibly closed by the remote host

I am writing python to crawl Twitter space using Twitter-py. I have set the crawler to sleep for a while (2 seconds) between each request to api.twitter.com. However, after some times of running (around 1), when the Twitter's rate limit not exceeded yet, I got this error. [Errno 10054] An existing connection was forcibly closed by the remote host. What are possible causes of this problem and how to solve this? I have searched through and found that the Twitter server itself may force to close the connection due to many requests. Thank you very much in advance. This can be caused by the two

Selenium wait for Ajax content to load - universal approach

阅读更多关于 Selenium wait for Ajax content to load - universal approach

Is there a universal approach for Selenium to wait till all ajax content has loaded? (not tied to a specific website - so it works for every ajax website) You need to wait for Javascript and jQuery to finish loading. Execute Javascript to check if jQuery.active is 0 and document.readyState is complete , which means the JS and jQuery load is complete. public boolean waitForJSandJQueryToLoad() { WebDriverWait wait = new WebDriverWait(driver, 30); // wait for jQuery to load ExpectedCondition<Boolean> jQueryLoad = new ExpectedCondition<Boolean>() { @Override public Boolean apply(WebDriver driver)