robots.txt

Ignore urls in robot.txt with specific parameters?

感情迁移 提交于 2019-11-27 11:45:19
I would like for google to ignore urls like this: http://www.mydomain.com/new-printers?dir=asc&order=price&p=3 All urls that have the parameters dir, order and price should be ignored but I dont have experience with Robots.txt. Any idea? Here's a solutions if you want to disallow query strings: Disallow: /*?* or if you want to be more precise on your query string: Disallow: /*?dir=*&order=*&p=* You can also add to the robots.txt which url to allow Allow: /new-printer$ The $ will make sure only the /new-printer will be allowed. More info: http://code.google.com/web/controlcrawlindex/docs/robots

What is the smartest way to handle robots.txt in Express?

旧巷老猫 提交于 2019-11-27 11:04:49
问题 I'm currently working on an application built with Express (Node.js) and I want to know what is the smartest way to handle different robots.txt for different environments (development, production). This is what I have right now but I'm not convinced by the solution, I think it is dirty: app.get '/robots.txt', (req, res) -> res.set 'Content-Type', 'text/plain' if app.settings.env == 'production' res.send 'User-agent: *\nDisallow: /signin\nDisallow: /signup\nDisallow: /signout\nSitemap:

robots.txt file for different domains of same site

痞子三分冷 提交于 2019-11-27 10:22:03
问题 I have an ASP.NET MVC 4 web application that can be accessed from multiple different domains. The site is fully localized based on the domain in the request (similar in concept to this question). I want to include a robots.txt file and I want to localize the robots.txt file based on the domain, but I am aware that I can only have one physical "robots.txt" text file in a site's file system directory. What is the easiest/best way (and is it even possible) to use the ASP.NET MVC framework to

Robots.txt: Is this wildcard rule valid?

对着背影说爱祢 提交于 2019-11-27 09:17:41
Simple question. I want to add: Disallow */*details-print/ Basically, blocking rules in the form of /foo/bar/dynamic-details-print --- foo and bar in this example can also be totally dynamic. I thought this would be simple, but then on www.robotstxt.org there is this message: Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot ", "Disallow: /tmp/*" or "Disallow: *.gif". So we can't do that? Do search engines

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

蹲街弑〆低调 提交于 2019-11-27 08:27:54
问题 I created a php page that is only accessible by means of token/pass received through $_GET Therefore if you go to the following url you'll get a generic or blank page http://fakepage11.com/secret_page.php However if you used the link with the token it shows you special content http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4 Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through

How to no index specific URLS?

妖精的绣舞 提交于 2019-11-27 08:23:20
问题 I was searching around on how to no index specific URLs but I havent found any specific info on the following. By adding the below <?php if(is_single(X)): ?> <meta name="robots" content="noindex,nofollow"> <?php endif; ?> I would be able to no index the (X) where X could be the post ID, the post title of “Hello World” for example , or a post slug of “hello-world”. Would if be possible to specify all URLs which start with the same post slug or title for example, as in the example below? www

Is there any advantage of using X-Robot-Tag instead of robots.txt?

大城市里の小女人 提交于 2019-11-27 08:14:30
问题 It looks like there are two mainstream solutions for instructing crawlers what to index and what not to index: adding an X-Robot-Tag HTTP header, or indicating a robots.txt. Is there any advantage to using the former? 回答1: With robots.txt you cannot disallow indexing of your documents. They have different purposes: robots.txt can disallow crawling (with Disallow ) X-Robots-Tag ¹ can disallow indexing (with noindex ) (And both offer additional different features, e.g., linking to your Sitemap

I cannot access Robots.txt in Spring-MVC

て烟熏妆下的殇ゞ 提交于 2019-11-27 08:05:26
问题 I am trying to give access to robots.txt in Spring-MVC. To test the code, I put robots.txt in WebContent , Root and WEB-INF but I cannot access to any of them. I've already applied answers of these questions 1,2,3 to no avail. MyCode <mvc:resources mapping="/resources/**" location="/resources/" /> <mvc:resources mapping="/robots.txt" location="/robots.txt" order="0" /> <mvc:annotation-driven /> 回答1: This works for me: Put robots.txt directly under webapp In mvc-dispatcher-servlet.xml have:

Multiple User Agents in Robots.txt

寵の児 提交于 2019-11-27 07:52:59
问题 In robots.txt file I have following sections User-Agent: Bot1 Disallow: /A User-Agent: Bot2 Disallow: /B User-Agent: * Disallow: /C Will statement Disallow:c be visible to Bot1 & Bot2 ? 回答1: tl;dr: No, Bot1 and Bot2 will happily crawl paths starting with C . Each bot only ever complies to at most a single record (block). Original spec In the original specification it says: If the value is '*', the record describes the default access policy for any robot that has not matched any of the other

Should I remove meta-robots (index, follow) when I have a robots.txt?

笑着哭i 提交于 2019-11-27 07:30:36
问题 I'm a bit confused whether I should remove the robots meta tag, if I want search engines to follow my robots.txt rules. If the robots meta-tag (index, follow) exists on the page, will search engines then ignore my robots.txt file and index the specified disallowed URLs in my robots.txt anyway? The reason why I'm asking about this, is that search engines (Google mainly) still indexes disallowed pages from my website. 回答1: If a search engine’s bot honors your robots.txt, and you disallow