Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

ぃ、小莉子 提交于 2019-11-28 14:10:15
unor

If a search engine bot finds the link with the token somehow¹, it may crawl and index it.

If you use robots.txt to disallow crawling the page, conforming search engine bots won’t crawl the page, but they may still index its URL (which then might appear in a site: search).

If you use meta-robots to disallow indexing the page, conforming search engine bots won’t index the page, but they may still crawl it.

You can’t have both: If you disallow crawling, conforming bots can never learn that you also disallow indexing, because they are not allowed to visit the page to see your meta-robots element.

¹ There are countless ways how search engines might find a link. For example, a user that visits the page might use a browser toolbar that automatically sends all visited URLs to a search engine.

If your page isn't discoverable then it will not be indexed.

by "discoverable" we mean:

  1. it is a standard web page, i.e. index.*
  2. it is referenced by another link either yours or from another site

So in your case by using the get parameter for access, you achieve 1 but not necessarily 2 since someone may reference that link and hence the "hidden" page.

You can use the robots.txt that you gave and in that case the page will not get indexed by a bot that respects that (not all will do). Not indexing your page doesn't mean of course that the "hidden" page URL will not be in the wild.

Furthermore another issue - depending on your requirements - is that you use unencrypted HTTP, that means that your "hidden" URLs and content of pages are visible to every server between your server and the user.

Apart from search engines take care that certain services are caching/resolving content when URLs are exchanged for example in Skype or Facebook messenger. In that cases they will visit the URL and try to extract metadata and maybe cache it if applicable. Of course this scenario does not expose your URL to the public but it is exposed to the systems of those services and with them the content that you have "hidden".

UPDATE: Another issue to consider is the exposing of a "hidden" page by linking to another page. In that case in the logs of the server that hosts the linked URL your page will be seen as a referral and thus be visible, that expands also to Google Analytics etc. Thus if you want to remain stealth do not link to another pages from the hidden page.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!