web-crawler

What does the dollar sign mean in robots.txt

半城伤御伤魂 提交于 2019-12-01 18:27:20
问题 I am curious about a website and want to do some web crawling at the /s path. Its robots.txt: User-Agent: * Allow: /$ Allow: /debug/ Allow: /qa/ Allow: /wiki/ Allow: /cgi-bin/loginpage Disallow: / My questions are: What does the dollar-sign mean in this case? And is it appropriate to crawl the URL /s ? with respect to the robots.txt file? 回答1: If you follow the original robots.txt specification, $ has no special meaning, and there is no Allow field defined. A conforming bot would have to

Does Google ignores whatever is after the hash fragment (#) while crawling our website?

假装没事ソ 提交于 2019-12-01 18:08:47
问题 We are using the information that is after the hash fragment to display different pages using JavaScript, in order not to force the browser to load the whole page again. For example a direct link to the page could look like this (book_id/page_id): www.example.com/book#1234/5678 Since we don't have direct links to each page but to the books, we are thinking to add these direct links into sitemap.xml. My question is wether Google is considering that as a separate link or just ignoring

Is it legal to crawl Amazon? [closed]

那年仲夏 提交于 2019-12-01 17:26:54
I want to have specific information from amazon like product name and description! Is it legal to crawl amazon. or Is amazon is providing any api for getting its data paid or nonpaid both Amazon's " Product Advertising API " allows this. You should closely read the license agreement as its highly restrictive as to what they allow you to do with it. 来源: https://stackoverflow.com/questions/11080584/is-it-legal-to-crawl-amazon

PHP- cannot change max_execution_time in xampp

独自空忆成欢 提交于 2019-12-01 17:07:06
I've tried everything to change the max_execution_time of a php crawler script so that it can run an infinite amount of time. I have changed the php.ini file setting max_execution_time to 0 or 100000000 but with no change I've also tried setting it from the php script itself by using ini_set('max_execution_time', 0); All php scripts throw the same error Fatal error: Maximum execution time of 3000 seconds exceeded , what could I be missing and how can I make sure there is no max execution time limit? php script <?php ini_set('MAX_EXECUTION_TIME', -1); error_reporting(E_ALL); // turn on all

Any possibility to crawl open web browser data using aperture

被刻印的时光 ゝ 提交于 2019-12-01 15:32:04
I known about crawl website using Aperture. if i open http://demo.crawljax.com/ in mozila web browser. how can crawl open browser content using Aperture. Steps: 1. Open http://demo.crawljax.com/ on your mozila firefox. 2. Executed java program to crawl open mozila firefox tab. Kumar It seems you need to crawl the JavaScript/Ajax page. You actually need a crawler like googlebot. See this googlebot can crawl the javascript page. You can do it using some other drivers/crawlers. Here similar question found. You can try out the best answer from here BasK Its imposable to crawl the open Webbrowser

How do I ignore file types in a web crawler?

半腔热情 提交于 2019-12-01 14:32:22
I'm writing a web crawler and want to ignore URLs which link to binary files: $exclude = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml) How can I check the URI against one of these endings? @url = URI.parse(url) should be set if it doesn't contain any of the suffixes above. use URI#path: unless URI.parse(url).path =~ /\.(\w+)$/ && $exclude.include?($1) puts "downloading #{url}..." end the Tin Man Ruby lacks a really useful module that Perl has,

Scrapy crawl all sitemap links

让人想犯罪 __ 提交于 2019-12-01 14:23:05
I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider . So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is: class MySpider(SitemapSpider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] def parse(self, response): print response.url You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want. For instance say you have a page named http:/

How to Get Crawl content in Crawljax

断了今生、忘了曾经 提交于 2019-12-01 14:10:58
I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me?? CrawljaxConfigurationBuilder builder = CrawljaxConfiguration.builderFor("http://demo.crawljax.com/"); builder.addPlugin(new OnNewStatePlugin() { @Override public String toString() { return "Our example plugin"; } @Override public void onNewState(CrawlerContext cc, StateVertex sv) { LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser().getStrippedDom()); String name = cc.getCurrentState().getName(); String url = cc.getBrowser()

Extract links for certain section only from blogspot using BeautifulSoup

Deadly 提交于 2019-12-01 13:37:16
I am trying to extract links for certain section only from Blogspot. But the output shows the codes extract all the link inside the page. Here is the codes: import urlparse import urllib from bs4 import BeautifulSoup url = "http://ellywonderland.blogspot.com/" urls = [url] visited = [url] while len(urls) >0: try: htmltext = urllib.urlopen(urls[0]).read() except: print urls[0] soup = BeautifulSoup(htmltext) urls.pop(0) print len (urls) for tags in soup.find_all(attrs={'class': "post-title entry-title"}): for tag in soup.findAll('a',href=True): tag['href'] = urlparse.urljoin(url,tag['href']) if

Nutch not crawling URLs except the one specified in seed.txt

做~自己de王妃 提交于 2019-12-01 13:33:06
I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt : +^https://www.mywebsite.com/abc-def/(.+)*$ When I try to run the following crawl command : **/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3** It crawl and index just one seed.txt url and in 2nd iteration it just say: Generator: starting at 2017