web-crawler | 易学教程

How to overwrite and perform actions on webpage with AbotX Javascriptrendering

阅读更多关于 How to overwrite and perform actions on webpage with AbotX Javascriptrendering

问题 I am trying to use the AbotX crawler to crawl a site where I need to render the javascript and the press a span tag on it. I've used the Abot crawler a lot and expected to having to overwrite some of the classes just as I have on previous occasions had to expand on for instance, the CrawlDecisionMaker. But I can't seem to find out where to start, I expect I have to writing something like: var implemnts = new ImplementationOverride(config); implemnts.JavascriptRenderer = new

Encoding issues crawling non-english websites

阅读更多关于 Encoding issues crawling non-english websites

问题 I'm trying to get the contents of a webpage as a string, and I found this question addressing how to write a basic web crawler, which claims to (and seems to) handle the encoding issue, however the code provided there, which works for US/English websites, fails to properly handle other languages. Here is a full Java class that demonstrates what I'm referring to: import java.io.IOException; import java.io.InputStreamReader; import java.io.Reader; import java.io.UnsupportedEncodingException;

Best practics for parallelize web crawler in .net 4.0

阅读更多关于 Best practics for parallelize web crawler in .net 4.0

问题 I need to download a lot of pages through proxies. What is best practice for building a multi-threaded web crawler? Is Parallel.For\Foreach is good enough or is it better for heavy CPU tasks? What do you say about following code? var multyProxy = new MultyProxy(); multyProxy.LoadProxyList(); Task[] taskArray = new Task[1000]; for(int i = 0; i < taskArray.Length; i++) { taskArray[i] = new Task( (obj) => { multyProxy.GetPage((string)obj); }, (object)"http://google.com" ); taskArray[i].Start();

Googlebot is crawling my site and entering ratings on my rating system

阅读更多关于 Googlebot is crawling my site and entering ratings on my rating system

问题 My rating system allows anonymous users to add ratings, but Google's crawler is rating things. How can I ensure that Googlebot won't follow the link? 回答1: You shouldn't accept a GET request for any action that modifies data (voting, editing a post, etc.). Your voting should be done via a POST request, which Googlebot won't perform. More information in this SO post: When do you use POST and when do you use GET? 回答2: Use a robots.txt to point out links that bots shouldn't follow. For example,

Is there any Python module that helps to crawl data from DOM loaded by Javascript?

阅读更多关于 Is there any Python module that helps to crawl data from DOM loaded by Javascript?

问题 I want to scrape data from a page which loads DOM elements using Ajax call. I have tried with the old solution line PyQt4-based scraping, which loads the DOM after it's fully loaded, but the problem is that I need to do a POST request and it's only available for GET. The new Python module ghost.py has time out issues: when it fetches a large DOM tree it raises a time out exception. If anyone knows any specific way or tools that can help me to do a POST request and grab the data after fully

How to web crawl some sites [closed]

阅读更多关于 How to web crawl some sites [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 5 years ago . I am starting a new project of crawling websites to retrieve and store data internally using a web service. I looked up some information and came across Scrapy and Beevolve web crawling services. My question is is it best to just create my own crawler with no prior experience

How to solve Mysql to mysql as I have some problems [duplicate]

阅读更多关于 How to solve Mysql to mysql as I have some problems [duplicate]

问题 This question already has answers here : Why shouldn't I use mysql_* functions in PHP? (15 answers) Closed 3 years ago . Th problem: it says Deprecated: mysql_pconnect(): The mysql extension is deprecated and will be removed in the future: use mysqli or PDO instead in /Applications/XAMPP/xamppfiles/htdocs/... I think I have change mysql to mysql but how? <?php $database="sphider_db"; $mysql_user = "root"; $mysql_password = ""; $mysql_host = "localhost"; $mysql_table_prefix = ""; $success =

How can I tell Google not to crawl a set of Urls

阅读更多关于 How can I tell Google not to crawl a set of Urls

问题 How do I stop google to crawl to certain urls in my application? For example: I want google to stop crawling all the URLs that starts with http://www.myhost-test.com/ What should I add in my robot.txt? 回答1: The answer can be found directly here: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 But it looks like you add "disallow" and your url. 来源： https://stackoverflow.com/questions/11542918/how-can-i-tell-google-not-to-crawl-a-set-of-urls

Fill out online form

阅读更多关于 Fill out online form

问题 I'm trying to fill out the form located at https://idp.ncedcloud.org/idp/AuthnEngine#/authn with a username and password. I want to know if went through successfully or not. I tried it ith python2, but I couldn't get it to work. #!/usr/bin/env python import urllib import urllib2 name = "username" name2 = "password" data = { "description" : name, "ember501": name2 } encoded_data = urllib.urlencode(data) content = urllib2.urlopen("https://idp.ncedcloud.org/idp/AuthnEngine#/authn", encoded_data)

How to generically crawl different websites using Python?

阅读更多关于 How to generically crawl different websites using Python?

问题 I want to extract comments from Dawn.com as well as from Tribune.com from any article. The way I'm extracting comments is, to target the class <div class="comment__body cf"> on Dawn while class="content" on Tribune.com How can I do it generically? It means, There is no similar pattern on these websites through which this can be achieve by one class. Shall I write separate code for each website? 回答1: It is not so easy to write an algorithm that can generically grab the wanted content from a