robots.txt

Python, Mechanize - request disallowed by robots.txt even after set_handle_robots and add_headers

时光怂恿深爱的人放手 提交于 2019-12-01 21:31:17
I have made a web crawler which gets all links till the 1st level of page and from them it gets all link and text plus imagelinks and alt. here is whole code: import urllib import re import time from threading import Thread import MySQLdb import mechanize import readability from bs4 import BeautifulSoup from readability.readability import Document import urlparse url = ["http://sparkbrowser.com"] i=0 while i<len(url): counterArray = [0] levelLinks = [] linkText = ["homepage"] levelLinks = [] def scraper(root,steps): urls = [root] visited = [root] counter = 0 while counter < steps: step_url =

What does the dollar sign mean in robots.txt

半城伤御伤魂 提交于 2019-12-01 18:27:20
问题 I am curious about a website and want to do some web crawling at the /s path. Its robots.txt: User-Agent: * Allow: /$ Allow: /debug/ Allow: /qa/ Allow: /wiki/ Allow: /cgi-bin/loginpage Disallow: / My questions are: What does the dollar-sign mean in this case? And is it appropriate to crawl the URL /s ? with respect to the robots.txt file? 回答1: If you follow the original robots.txt specification, $ has no special meaning, and there is no Allow field defined. A conforming bot would have to

Regexp for robots.txt

只愿长相守 提交于 2019-12-01 16:03:52
问题 I am trying to set up my robots.txt, but I am not sure about the regexps. I've got four different pages all available in three different languages. Instead of listing each page times 3, I figured I could use a regexp. nav.aspx page.aspx/changelang (might have a query string attached such as "?toLang=fr".) mypage.aspx?id and login.aspx/logoff (=12346?... etc - different each time) ! All four in 3 different languages, e.g: www.example.com/es/nav.aspx www.example.com/it/nav.aspx www.example.com

Where to put robots.txt file? [closed]

廉价感情. 提交于 2019-12-01 04:50:39
Where should put robots.txt? domainname.com/robots.txt or domainname/public_html/robots.txt I placed the file in domainname.com/robots.txt , but it's not opening when I type this in browser. alt text http://shup.com/Shup/358900/11056202047-My-Desktop.png Where the file goes in your filesystem depends on what host you're using, so it's hard for us to give a specific answer about that. The best description is: put it wherever the index.html (or index.php or whatever) file is that represents your homepage. If that's domainname/public_html/index.html , for example, put it in domainname/public_html

How to add route to dynamic robots.txt in ASP.NET MVC?

一曲冷凌霜 提交于 2019-12-01 04:02:08
I have a robots.txt that is not static but generated dynamically. My problem is creating a route from root/robots.txt to my controller action. This works : routes.MapRoute( name: "Robots", url: "robots", defaults: new { controller = "Home", action = "Robots" }); This doesn't work : routes.MapRoute( name: "Robots", url: "robots.txt", /* this is the only thing I've changed */ defaults: new { controller = "Home", action = "Robots" }); The ".txt" causes ASP to barf apparently You need to add the following to your web.config file to allow the route with a file extension to execute. <?xml version="1

How to block search engines from indexing all urls beginning with origin.domainname.com

倖福魔咒の 提交于 2019-12-01 03:30:16
问题 I have www.domainname.com, origin.domainname.com pointing to the same codebase. Is there a way, I can prevent all urls of basename origin.domainname.com from getting indexed. Is there some rule in robot.txt to do it. Both the urls are pointing to the same folder. Also, I tried redirecting origin.domainname.com to www.domainname.com in htaccess file but it doesnt seem to work.. If anyone who has had a similar kind of problem and can help, I shall be grateful. Thanks 回答1: You can rewrite robots

Where to put robots.txt file? [closed]

被刻印的时光 ゝ 提交于 2019-12-01 01:53:25
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . Where should put robots.txt? domainname.com/robots.txt or domainname/public_html/robots.txt I placed the file in domainname.com/robots.txt , but it's not opening when I type this in browser. alt text http://shup.com/Shup/358900/11056202047-My-Desktop.png 回答1: Where the file goes in your filesystem depends on

Angular2 + webpack do not deploy robots.txt

一世执手 提交于 2019-12-01 00:14:09
问题 I am creating a web site with Angular2@2.1.2. I am using Webpack with default settings (as a dependency). Here is my package.json "dependencies": { "@angular/common": "2.1.2", "@angular/compiler": "2.1.2", "@angular/core": "2.1.2", "@angular/forms": "2.1.2", "@angular/http": "2.1.2", "@angular/platform-browser": "2.1.2", "@angular/platform-browser-dynamic": "2.1.2", "@angular/platform-server": "2.1.2", "@angular/router": "3.1.2", "@ngrx/core": "1.2.0", "@ngrx/effects": "2.0.0", "@ngrx/store":

Disallow or Noindex on Subdomain with robots.txt

心不动则不痛 提交于 2019-11-30 22:31:09
问题 I have dev.example.com and www.example.com hosted on different subdomains. I want crawlers to drop all records of the dev subdomain but keep them on www . I am using git to store the code for both, so ideally I'd like both sites to use the same robots.txt file. Is it possible to use one robots.txt file and have it exclude crawlers from the dev subdomain? 回答1: Sorry, this is most likely not possible. The general rule is that each sub-domain is treated separately and thus would both need robots

Facebook and Crawl-delay in Robots.txt?

跟風遠走 提交于 2019-11-30 12:49:46
Does Facebook's webcrawling bots respect the Crawl-delay: directive in robots.txt files? We don't have a crawler. We have a scraper that scrapes meta data on pages that have like buttons/are shared on FB. No, it doesn't respect robots.txt Contrary to other answers here, facebookexternalhit behaves like the meanest of crawlers. Whether it got the urls it requests from crawling or from like buttons doesn't matter so much when it goes through every one of those at an insane rate. We sometimes get several hundred hits per second as it goes through almost every url on our site. It kills our servers