robots.txt | 易学教程

robots.txt file for different domains of same site

阅读更多关于 robots.txt file for different domains of same site

I have an ASP.NET MVC 4 web application that can be accessed from multiple different domains. The site is fully localized based on the domain in the request (similar in concept to this question ). I want to include a robots.txt file and I want to localize the robots.txt file based on the domain, but I am aware that I can only have one physical "robots.txt" text file in a site's file system directory. What is the easiest/best way (and is it even possible) to use the ASP.NET MVC framework to achieve a robots.txt file on a per-domain basis so that the same site installation serves content to

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

阅读更多关于 Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

I created a php page that is only accessible by means of token/pass received through $_GET Therefore if you go to the following url you'll get a generic or blank page http://fakepage11.com/secret_page.php However if you used the link with the token it shows you special content http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4 Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link. Are dynamic pages that are dependent of $_GET variables indexed by google and other

Is there any advantage of using X-Robot-Tag instead of robots.txt?

阅读更多关于 Is there any advantage of using X-Robot-Tag instead of robots.txt?

It looks like there are two mainstream solutions for instructing crawlers what to index and what not to index: adding an X-Robot-Tag HTTP header, or indicating a robots.txt. Is there any advantage to using the former? unor With robots.txt you cannot disallow indexing of your documents. They have different purposes: robots.txt can disallow crawling (with Disallow ) X-Robots-Tag ¹ can disallow indexing (with noindex ) (And both offer additional different features, e.g., linking to your Sitemap in robots.txt , disallowing following links in X-Robots-Tag , and many more.) Crawling means accessing

I cannot access Robots.txt in Spring-MVC

阅读更多关于 I cannot access Robots.txt in Spring-MVC

I am trying to give access to robots.txt in Spring-MVC. To test the code, I put robots.txt in WebContent , Root and WEB-INF but I cannot access to any of them. I've already applied answers of these questions 1 , 2 , 3 to no avail. MyCode <mvc:resources mapping="/resources/**" location="/resources/" /> <mvc:resources mapping="/robots.txt" location="/robots.txt" order="0" /> <mvc:annotation-driven /> This works for me: Put robots.txt directly under webapp In mvc-dispatcher-servlet.xml have: <mvc:default-servlet-handler/> <mvc:resources mapping="/resources/**" location="/, classpath:/META-INF/web

Multiple User Agents in Robots.txt

阅读更多关于 Multiple User Agents in Robots.txt

In robots.txt file I have following sections User-Agent: Bot1 Disallow: /A User-Agent: Bot2 Disallow: /B User-Agent: * Disallow: /C Will statement Disallow:c be visible to Bot1 & Bot2 ? unor tl;dr: No, Bot1 and Bot2 will happily crawl paths starting with C . Each bot only ever complies to at most a single record (block) . Original spec In the original specification it says: If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. Expired RFC draft The original spec, including some additions (like Allow ) became a draft for

How to stop Google indexing my Github repository

阅读更多关于 How to stop Google indexing my Github repository

I use Github to store the text of one of my web sites, but the problem is Google indexing the text in Github as well. So the same text will show up both on my site and on Github. e.g. this search The top hit is my site. The second hit is the Github repository. I don't mind if people see the sources but I don't want Google to index it (and maybe penalize for duplicate content.) Is there any way, besides taking the repository private, to tell Google to stop indexing it? What happens in the case of Github Pages ? Those are sites where the source is in a Github repository. Do they have the same

Wildcards in robots.txt

阅读更多关于 Wildcards in robots.txt

If in WordPress website I have categories in this order: -Parent --Child ---Subchild I have permalinks set to: %category%/%postname% Let use an example. I create post with post name "Sport game". It's tag is sport-game. It's full url is: domain.com/parent/child/subchild/sport-game Why I use this kind of permalinks is exactly to block some content easier in robots.txt. And now this is the part I have question for. In robots.txt: User-agent: Googlebot Disallow: /parent/* Disallow: /parent/*/* Disallow: /parent/*/*/* Disallow: /parent/* Is meaning of this rule that it's blocking domain.com/parent

order of directives in robots.txt, do they overwrite each other or complement each other?

阅读更多关于 order of directives in robots.txt, do they overwrite each other or complement each other?

User-agent: Googlebot Disallow: /privatedir/ User-agent: * Disallow: / Now, what are disallowed for Googlebot: /privatedir/, or the whole website / ? According to the original robots.txt specification : A bot must follow the first record that matches its user-agent name. If such a record doesn’t exist, it must follow the record with User-agent: * (this line may not appear in more than one record). If such a record doesn’t exist, it doesn’t have to follow any record. So a bot never follows more than one record. For your example this means: A bot that matches the name "Googlebot" is not allowed

How do I disallow specific page from robots.txt

阅读更多关于 How do I disallow specific page from robots.txt

I am creating two pages on my site that are very similar but serve different purposes. One is to thank users for leaving a comment and the other is to encourage users to subscribe. I don't want the duplicate content but I do want the pages to be available. Can I set the sitemap to hide one? Would I do this in the robots.txt file? The disallow looks like this: Disallow: /wp-admin How would I customize to the a specific page like: http://sweatingthebigstuff.com/thank-you-for-commenting AlexanderMP Disallow: /thank-you-for-commenting in robots.txt Take a look at last.fm robots.txt file for

Can a relative sitemap url be used in a robots.txt?

阅读更多关于 Can a relative sitemap url be used in a robots.txt?

In robots.txt can I write the following relative URL for the sitemap file? sitemap: /sitemap.ashx Or do I have to use the complete (absolute) URL for the sitemap file, like: sitemap: http://subdomain.domain.com/sitemap.ashx Why I wonder: I own a new blog service, www.domain.com, that allow users to blog on accountname.domain.com. I use wildcards, so all subdomains (accounts) point to: "blog.domain.com". In blog.domain.com I put the robots.txt to let search engines find the sitemap. But, due to the wildcards, all user account share the same robots.txt file.Thats why I can't use the second