web-crawler

Get a list of URLs from a site [closed]

蓝咒 提交于 2019-11-27 09:11:15
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous. So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need

HtmlUnit Only Displays Host HTML Page for GWT App

Deadly 提交于 2019-11-27 09:09:04
I am using HtmlUnit API to add crawler support to my GWT app as follows: PrintWriter out = null; try { resp.setCharacterEncoding(CHAR_ENCODING); resp.setContentType("text/html"); url = buildUrl(req); out = resp.getWriter(); WebClient webClient = webClientProvider.get(); // set options WebClientOptions options = webClient.getOptions(); options.setCssEnabled(false); options.setThrowExceptionOnScriptError(false); options.setThrowExceptionOnFailingStatusCode(false); options.setRedirectEnabled(true); options.setJavaScriptEnabled(true); // set timeouts webClient.setJavaScriptTimeout(0); webClient

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

蹲街弑〆低调 提交于 2019-11-27 08:27:54
问题 I created a php page that is only accessible by means of token/pass received through $_GET Therefore if you go to the following url you'll get a generic or blank page http://fakepage11.com/secret_page.php However if you used the link with the token it shows you special content http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4 Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through

Scrapy CrawlSpider doesn't crawl the first landing page

a 夏天 提交于 2019-11-27 08:07:20
I am new to Scrapy and I am working on a scraping exercise and I am using the CrawlSpider. Although the Scrapy framework works beautifully and it follows the relevant links, I can't seem to make the CrawlSpider to scrape the very first link (the home page / landing page). Instead it goes directly to scrape the links determined by the rule but doesn't scrape the landing page on which the links are. I don't know how to fix this since it is not recommended to overwrite the parse method for a CrawlSpider. Modifying follow=True/False also doesn't yield any good results. Here is the snippet of code:

Scrapy - Reactor not Restartable

谁说胖子不能爱 提交于 2019-11-27 07:30:17
with: from twisted.internet import reactor from scrapy.crawler import CrawlerProcess I've always ran this process sucessfully: process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is finished process.start() but since I've moved this code into a web_crawler(self) function, like so: def web_crawler(self): # set up a crawler process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is finished process.start() # (...) return (result1, result2) and started calling the method

how do web crawlers handle javascript

≯℡__Kan透↙ 提交于 2019-11-27 07:07:22
Today a lot of content on Internet is generated using JavaScript (specifically by background AJAX calls). I was wondering how web crawlers like Google handle them. Are they aware of JavaScript? Do they have a built-in JavaScript engine? Or do they simple ignore all JavaScript generated content in the page (I guess quite unlikely). Do people use specific techniques for getting their content indexed which would otherwise be available through background AJAX requests to a normal Internet user? JavaScript is handled by both Bing and Google crawlers. Yahoo uses the Bing crawler data, so it should

HTTPWebResponse + StreamReader Very Slow

Deadly 提交于 2019-11-27 06:45:55
I'm trying to implement a limited web crawler in C# (for a few hundred sites only) using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string. I'm only downloading pages which are about 5-10K. It's all very slow! For example, the average GetResponse() time is about half a second, while the average StreamREader.ReadToEnd() time is about 5 seconds! All sites should be very fast, as they are very close to my location, and have fast servers. (in Explorer takes practically nothing to D/L) and I am not using any proxy.

Web crawler that can interpret JavaScript [closed]

我是研究僧i 提交于 2019-11-27 06:42:11
I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though Firebug. How would I go about doing this? What tools exist that would help me? Ruby's Capybara is an integration test library, but it can also be used to write stand-alone web-crawlers. Given that it uses backends like Selenium or

How to identify web-crawler?

爱⌒轻易说出口 提交于 2019-11-27 06:25:23
How can I filter out hits from webcrawlers etc. Hits which not is human.. I use maxmind.com to request the city from the IP.. It is not quite cheap if I have to pay for ALL hits including webcrawlers, robots etc. There are two general ways to detect robots and I would call them "Polite/Passive" and "Aggressive". Basically, you have to give your web site a psychological disorder. Polite These are ways to politely tell crawlers that they shouldn't crawl your site and to limit how often you are crawled. Politeness is ensured through robots.txt file in which you specify which bots, if any, should

Detecting honest web crawlers

半腔热情 提交于 2019-11-27 06:05:31
I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents? If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is