web-crawler | 易学教程

What does the dollar sign mean in robots.txt

阅读更多关于 What does the dollar sign mean in robots.txt

问题 I am curious about a website and want to do some web crawling at the /s path. Its robots.txt: User-Agent: * Allow: /$ Allow: /debug/ Allow: /qa/ Allow: /wiki/ Allow: /cgi-bin/loginpage Disallow: / My questions are: What does the dollar-sign mean in this case? And is it appropriate to crawl the URL /s ? with respect to the robots.txt file? 回答1: If you follow the original robots.txt specification, $ has no special meaning, and there is no Allow field defined. A conforming bot would have to

Does Google ignores whatever is after the hash fragment (#) while crawling our website?

阅读更多关于 Does Google ignores whatever is after the hash fragment (#) while crawling our website?

问题 We are using the information that is after the hash fragment to display different pages using JavaScript, in order not to force the browser to load the whole page again. For example a direct link to the page could look like this (book_id/page_id): www.example.com/book#1234/5678 Since we don't have direct links to each page but to the books, we are thinking to add these direct links into sitemap.xml. My question is wether Google is considering that as a separate link or just ignoring

Is it legal to crawl Amazon? [closed]

阅读更多关于 Is it legal to crawl Amazon? [closed]

I want to have specific information from amazon like product name and description! Is it legal to crawl amazon. or Is amazon is providing any api for getting its data paid or nonpaid both Amazon's " Product Advertising API " allows this. You should closely read the license agreement as its highly restrictive as to what they allow you to do with it. 来源： https://stackoverflow.com/questions/11080584/is-it-legal-to-crawl-amazon

PHP- cannot change max_execution_time in xampp

阅读更多关于 PHP- cannot change max_execution_time in xampp

I've tried everything to change the max_execution_time of a php crawler script so that it can run an infinite amount of time. I have changed the php.ini file setting max_execution_time to 0 or 100000000 but with no change I've also tried setting it from the php script itself by using ini_set('max_execution_time', 0); All php scripts throw the same error Fatal error: Maximum execution time of 3000 seconds exceeded , what could I be missing and how can I make sure there is no max execution time limit? php script <?php ini_set('MAX_EXECUTION_TIME', -1); error_reporting(E_ALL); // turn on all

Any possibility to crawl open web browser data using aperture

阅读更多关于 Any possibility to crawl open web browser data using aperture

I known about crawl website using Aperture. if i open http://demo.crawljax.com/ in mozila web browser. how can crawl open browser content using Aperture. Steps: 1. Open http://demo.crawljax.com/ on your mozila firefox. 2. Executed java program to crawl open mozila firefox tab. Kumar It seems you need to crawl the JavaScript/Ajax page. You actually need a crawler like googlebot. See this googlebot can crawl the javascript page. You can do it using some other drivers/crawlers. Here similar question found. You can try out the best answer from here BasK Its imposable to crawl the open Webbrowser

How do I ignore file types in a web crawler?

阅读更多关于 How do I ignore file types in a web crawler?

I'm writing a web crawler and want to ignore URLs which link to binary files: $exclude = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml) How can I check the URI against one of these endings? @url = URI.parse(url) should be set if it doesn't contain any of the suffixes above. use URI#path: unless URI.parse(url).path =~ /\.(\w+)$/ && $exclude.include?($1) puts "downloading #{url}..." end the Tin Man Ruby lacks a really useful module that Perl has,

Scrapy crawl all sitemap links

阅读更多关于 Scrapy crawl all sitemap links

I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider . So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is: class MySpider(SitemapSpider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] def parse(self, response): print response.url You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want. For instance say you have a page named http:/

How to Get Crawl content in Crawljax

阅读更多关于 How to Get Crawl content in Crawljax

I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me?? CrawljaxConfigurationBuilder builder = CrawljaxConfiguration.builderFor("http://demo.crawljax.com/"); builder.addPlugin(new OnNewStatePlugin() { @Override public String toString() { return "Our example plugin"; } @Override public void onNewState(CrawlerContext cc, StateVertex sv) { LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser().getStrippedDom()); String name = cc.getCurrentState().getName(); String url = cc.getBrowser()

Extract links for certain section only from blogspot using BeautifulSoup

阅读更多关于 Extract links for certain section only from blogspot using BeautifulSoup

I am trying to extract links for certain section only from Blogspot. But the output shows the codes extract all the link inside the page. Here is the codes: import urlparse import urllib from bs4 import BeautifulSoup url = "http://ellywonderland.blogspot.com/" urls = [url] visited = [url] while len(urls) >0: try: htmltext = urllib.urlopen(urls[0]).read() except: print urls[0] soup = BeautifulSoup(htmltext) urls.pop(0) print len (urls) for tags in soup.find_all(attrs={'class': "post-title entry-title"}): for tag in soup.findAll('a',href=True): tag['href'] = urlparse.urljoin(url,tag['href']) if

Nutch not crawling URLs except the one specified in seed.txt

阅读更多关于 Nutch not crawling URLs except the one specified in seed.txt

I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt : +^https://www.mywebsite.com/abc-def/(.+)*$ When I try to run the following crawl command : **/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3** It crawl and index just one seed.txt url and in 2nd iteration it just say: Generator: starting at 2017