robots.txt

Robots.txt Disallow Certain Folder Names

依然范特西╮ 提交于 2019-12-05 09:06:45
I want to disallow robots from crawling any folder, at any position in the url with the name: this-folder . Examples to disallow: http://mysite.com/this-folder/ http://mysite.com/houses/this-folder/ http://mysite.com/some-other/this-folder/ http://mysite.com/no-robots/this-folder/ This is my attempt: Disallow: /.*this-folder/ Will this work? Officially globbing and regular expressions are not supported: http://www.robotstxt.org/robotstxt.html but apparently some search engines support this. 来源: https://stackoverflow.com/questions/3501661/robots-txt-disallow-certain-folder-names

robots.txt allow all except few sub-directories

点点圈 提交于 2019-12-05 07:52:28
I want my site to be indexed in search engines except few sub-directories. Following are my robots.txt settings: robots.txt in the root directory User-agent: * Allow: / Separate robots.txt in the sub-directory (to be excluded) User-agent: * Disallow: / Is it the correct way or the root directory rule will override the sub-directory rule? unor No, this is wrong. You can’t have a robots.txt in a sub-directory. Your robots.txt must be placed in the document root of your host. If you want to disallow crawling of URLs whose paths begin with /foo , use this record in your robots.txt ( http://example

Googlebots Ignoring robots.txt? [closed]

点点圈 提交于 2019-12-05 03:19:08
Closed. This question is off-topic . It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . I have a site with the following robots.txt in the root: User-agent: * Disabled: / User-agent: Googlebot Disabled: / User-agent: Googlebot-Image Disallow: / And pages within this site are getting scanned by Googlebots all day long. Is there something wrong with my file or with Google? It should be Disallow: , not Disabled: . Maybe give the Google robots.txt checker a try Google have an analysis tool for checking

Can I use the “Host” directive in robots.txt?

∥☆過路亽.° 提交于 2019-12-05 03:16:27
Searching for specific information on the robots.txt , I stumbled upon a Yandex help page on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain: User-Agent: * Disallow: /dir/ Host: www.myhost.com Also, the Wikipedia article states that Google too understands the Host directive, but there wasn’t much (i.e. none) information. At robotstxt.org , I didn’t find anything on Host (or Crawl-delay as stated on Wikipedia). Is it encouraged to use the Host directive at all? Are there any resources at Google on this robots.txt specific? How is

Ban robots from website [closed]

若如初见. 提交于 2019-12-04 23:49:25
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . my website is often down because a spider is accessying to many resources. This is what the hosting told me. They told me to ban these IP address: 46.229.164.98 46.229.164.100 46.229.164.101 But I've no idea about how to do this. I've googled a bit and I've now added these lines to .htaccess in the root: # allow

Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

回眸只為那壹抹淺笑 提交于 2019-12-04 22:42:16
Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file. I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method. Questions: (1) Can each user agent have it's own crawl-delay? (I assume yes) (2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line? (3) Does there have to be a blank

Does the user agent string have to be exactly as it appears in my server logs?

天大地大妈咪最大 提交于 2019-12-04 13:39:41
When using a Robots.txt file, does the user agent string have to be exactly as it appears in my server logs? For example when trying to match GoogleBot, can I just use googlebot ? Also, will a partial-match work? For example just using Google ? Yes, the user agent has to be an exact match. From robotstxt.org : "globbing and regular expression are not supported in either the User-agent or Disallow lines" At least for googlebot, the user-agent is non-case-sensitive. Read the 'Order of precedence for user-agents' section: https://code.google.com/intl/de/web/controlcrawlindex/docs/robots_txt.html

Rails robots.txt folders

北慕城南 提交于 2019-12-04 09:03:59
问题 I'm about to launch a Rails app and as the last task, I wan't to set the robots.txt file. I couldn't find information about how the paths should be written properly for a Rails app. Is the starting path always the root path from the Rails app or the app folder? How would I then disallow e.g. the img folder? Do I have to write the paths as I see it in the app folder, or like how the paths look like on the site online, e.g. http://example.com/admin ? 回答1: you have to put your robots.txt in

robots.txt parser java

筅森魡賤 提交于 2019-12-04 06:19:19
I want to know how to parse the robots.txt in java. Is there already any code? Heritrix is an open-source web crawler written in Java. Looking through their javadoc, I see that they have a utility class Robotstxt for parsing the robots.txt file. There's also jrobotx library hosted at SourceForge. (Full disclosure: I spun off the code that forms that library.) anastluc There is also a new release of crawler-commons: https://github.com/crawler-commons/crawler-commons The library aims to implement functionality common to any web crawler and this includes a very handy robots.txt parser 来源: https:/

Does related subfolders need to be disallowed separately in robots.txt?

耗尽温柔 提交于 2019-12-04 06:10:22
问题 Will disallowing certain folder in robots.txt disallow its related subfolders? Example: Disallow:/folder/ Will match: /folder/page /folder/subfolder/page Or it will just match: /folder/page So if the second case is true, do I need to disallow second and subsequent subfolder separately? Disallow: /folder/ Disallow /folder/subfolder/ Disallow /folder/subfolder/onemorefolder 回答1: Robots.txt has no concept of "folders", it’s just strings. Whatever you specify in Disallow is the beginning of the