robots.txt

How to add route to dynamic robots.txt in ASP.NET MVC?

最后都变了- 提交于 2019-12-30 08:52:52
问题 I have a robots.txt that is not static but generated dynamically. My problem is creating a route from root/robots.txt to my controller action. This works : routes.MapRoute( name: "Robots", url: "robots", defaults: new { controller = "Home", action = "Robots" }); This doesn't work : routes.MapRoute( name: "Robots", url: "robots.txt", /* this is the only thing I've changed */ defaults: new { controller = "Home", action = "Robots" }); The ".txt" causes ASP to barf apparently 回答1: You need to add

Robots.txt Allow sub folder but not the parent

对着背影说爱祢 提交于 2019-12-30 05:38:08
问题 Can anybody please explain the correct robots.txt command for the following scenario. I would like to allow access to: /directory/subdirectory/.. But I would also like to restrict access to /directory/ not withstanding the above exception. 回答1: Be aware that there is no real official standard and that any web crawler may happily ignore your robots.txt According to a Google groups post, the following works at least with GoogleBot; User-agent: Googlebot Disallow: /directory/ Allow: /directory

Facebook and Crawl-delay in Robots.txt?

牧云@^-^@ 提交于 2019-12-30 04:07:25
问题 Does Facebook's webcrawling bots respect the Crawl-delay: directive in robots.txt files? 回答1: We don't have a crawler. We have a scraper that scrapes meta data on pages that have like buttons/are shared on FB. 回答2: No, it doesn't respect robots.txt Contrary to other answers here, facebookexternalhit behaves like the meanest of crawlers. Whether it got the urls it requests from crawling or from like buttons doesn't matter so much when it goes through every one of those at an insane rate. We

Is the User-Agent line in robots.txt an exact match or a substring match?

瘦欲@ 提交于 2019-12-29 08:49:14
问题 When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent? Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match. However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from

How may i prevent search engines from crawling a subdomain on my website?

拥有回忆 提交于 2019-12-25 06:06:26
问题 I have cPanel installed on my website. I went to the Domains section on cPanel I clicked on subdomains. I assigned the subdomain name (e.g : personal.mywebsite.com ) It wanted me to assign document root folder also. I assigned mywebsite.com/personal if i create robots.txt in my website root(e.g : website.com) User-agent: Disallow: /personal/ Can it also block personal.mywebsite.com? what should i do? thanks 回答1: When you want to block URLs on personal.example.com , visit http://personal

How robots.txt file should be properly written for subdomains?

China☆狼群 提交于 2019-12-25 05:21:54
问题 Can someone explain me how should i write a robots.txt file if i want that all crawlers index root and some specific subdomains User-agent: * Allow: / Allow: /subdomain1/ Allow: /subdomain2/ Is this right? And where should i put it? In the root (public_html) folder or in each subdomain folder? 回答1: There is no way to specify rules for different subdomains within a single robots.txt file. A given robots.txt file will only control crawling of the subdomain it was requested from. If you want to

How robots.txt file should be properly written for subdomains?

我是研究僧i 提交于 2019-12-25 05:21:46
问题 Can someone explain me how should i write a robots.txt file if i want that all crawlers index root and some specific subdomains User-agent: * Allow: / Allow: /subdomain1/ Allow: /subdomain2/ Is this right? And where should i put it? In the root (public_html) folder or in each subdomain folder? 回答1: There is no way to specify rules for different subdomains within a single robots.txt file. A given robots.txt file will only control crawling of the subdomain it was requested from. If you want to

Implementing “Report this content” and detecting spammer or robot triggered event

一世执手 提交于 2019-12-25 04:56:23
问题 I'm creating a forum for a website, and plan on implementing a "Report this content" function. In all honesty, I'm not sure how useful (lit. necessary) the feature will be, since a user account (created by admin) will be required for posting, but the solution interests me. So in short, this is the scenario: For all users, there will be read-only access to all (non-restricted) content on the forum. For unidentified users there will be a reply button and report this content button present. The

remove pages from google dynamic url - robots.txt

落爺英雄遲暮 提交于 2019-12-24 18:06:12
问题 I have a few links on google that are domain.com/results.php?name=a&address=b The results page/parameters has now been renamed and I need to remove the existing links on google etc. I tried User-agent: * Disallow: /results.php in robots.txt and then on google webmaster added the url to be removed: domain.com/results.php it says it was removed successfully, however when I look at google an type domain.com - the existing urls with parameters are all still there. What am I doing wrong? There are

Google is ignoring my robots.txt [closed]

不羁岁月 提交于 2019-12-23 19:24:01
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . Here is content of my robots.txt file: User-agent: * Disallow: /images/ Disallow: /upload/ Disallow: /admin/ As you can see, I explicitly disallowed all robots to index the folders images , upload and admin . The problem is that one of my clients sent request for removing the content from the images folder