robots.txt | 易学教程

Robots.txt Disallow Certain Folder Names

阅读更多关于 Robots.txt Disallow Certain Folder Names

I want to disallow robots from crawling any folder, at any position in the url with the name: this-folder . Examples to disallow: http://mysite.com/this-folder/ http://mysite.com/houses/this-folder/ http://mysite.com/some-other/this-folder/ http://mysite.com/no-robots/this-folder/ This is my attempt: Disallow: /.*this-folder/ Will this work? Officially globbing and regular expressions are not supported: http://www.robotstxt.org/robotstxt.html but apparently some search engines support this. 来源： https://stackoverflow.com/questions/3501661/robots-txt-disallow-certain-folder-names

robots.txt allow all except few sub-directories

阅读更多关于 robots.txt allow all except few sub-directories

I want my site to be indexed in search engines except few sub-directories. Following are my robots.txt settings: robots.txt in the root directory User-agent: * Allow: / Separate robots.txt in the sub-directory (to be excluded) User-agent: * Disallow: / Is it the correct way or the root directory rule will override the sub-directory rule? unor No, this is wrong. You can’t have a robots.txt in a sub-directory. Your robots.txt must be placed in the document root of your host. If you want to disallow crawling of URLs whose paths begin with /foo , use this record in your robots.txt ( http://example

Googlebots Ignoring robots.txt? [closed]

阅读更多关于 Googlebots Ignoring robots.txt? [closed]

Closed. This question is off-topic . It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . I have a site with the following robots.txt in the root: User-agent: * Disabled: / User-agent: Googlebot Disabled: / User-agent: Googlebot-Image Disallow: / And pages within this site are getting scanned by Googlebots all day long. Is there something wrong with my file or with Google? It should be Disallow: , not Disabled: . Maybe give the Google robots.txt checker a try Google have an analysis tool for checking

Can I use the “Host” directive in robots.txt?

阅读更多关于 Can I use the “Host” directive in robots.txt?

Searching for specific information on the robots.txt , I stumbled upon a Yandex help page on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain: User-Agent: * Disallow: /dir/ Host: www.myhost.com Also, the Wikipedia article states that Google too understands the Host directive, but there wasn’t much (i.e. none) information. At robotstxt.org , I didn’t find anything on Host (or Crawl-delay as stated on Wikipedia). Is it encouraged to use the Host directive at all? Are there any resources at Google on this robots.txt specific? How is

Ban robots from website [closed]

阅读更多关于 Ban robots from website [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . my website is often down because a spider is accessying to many resources. This is what the hosting told me. They told me to ban these IP address: 46.229.164.98 46.229.164.100 46.229.164.101 But I've no idea about how to do this. I've googled a bit and I've now added these lines to .htaccess in the root: # allow

Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

阅读更多关于 Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file. I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method. Questions: (1) Can each user agent have it's own crawl-delay? (I assume yes) (2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line? (3) Does there have to be a blank

Does the user agent string have to be exactly as it appears in my server logs?

阅读更多关于 Does the user agent string have to be exactly as it appears in my server logs?

When using a Robots.txt file, does the user agent string have to be exactly as it appears in my server logs? For example when trying to match GoogleBot, can I just use googlebot ? Also, will a partial-match work? For example just using Google ? Yes, the user agent has to be an exact match. From robotstxt.org : "globbing and regular expression are not supported in either the User-agent or Disallow lines" At least for googlebot, the user-agent is non-case-sensitive. Read the 'Order of precedence for user-agents' section: https://code.google.com/intl/de/web/controlcrawlindex/docs/robots_txt.html

Rails robots.txt folders

阅读更多关于 Rails robots.txt folders

问题 I'm about to launch a Rails app and as the last task, I wan't to set the robots.txt file. I couldn't find information about how the paths should be written properly for a Rails app. Is the starting path always the root path from the Rails app or the app folder? How would I then disallow e.g. the img folder? Do I have to write the paths as I see it in the app folder, or like how the paths look like on the site online, e.g. http://example.com/admin ? 回答1: you have to put your robots.txt in

robots.txt parser java

阅读更多关于 robots.txt parser java

I want to know how to parse the robots.txt in java. Is there already any code? Heritrix is an open-source web crawler written in Java. Looking through their javadoc, I see that they have a utility class Robotstxt for parsing the robots.txt file. There's also jrobotx library hosted at SourceForge. (Full disclosure: I spun off the code that forms that library.) anastluc There is also a new release of crawler-commons: https://github.com/crawler-commons/crawler-commons The library aims to implement functionality common to any web crawler and this includes a very handy robots.txt parser 来源： https:/

Does related subfolders need to be disallowed separately in robots.txt?

阅读更多关于 Does related subfolders need to be disallowed separately in robots.txt?

问题 Will disallowing certain folder in robots.txt disallow its related subfolders? Example: Disallow:/folder/ Will match: /folder/page /folder/subfolder/page Or it will just match: /folder/page So if the second case is true, do I need to disallow second and subsequent subfolder separately? Disallow: /folder/ Disallow /folder/subfolder/ Disallow /folder/subfolder/onemorefolder 回答1: Robots.txt has no concept of "folders", it’s just strings. Whatever you specify in Disallow is the beginning of the