robots.txt

Ethics of robots.txt [closed]

核能气质少年 提交于 2019-11-30 10:56:53
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 6 years ago . I have a serious question. Is it ever ethical to ignore the presence of a robots.txt file on a website? These are some of the considerations I've got in mind: If someone puts a web site up they're expecting some visits. Granted, web crawlers are using bandwidth without

Which is the best programming language to write a web bot? [closed]

喜你入骨 提交于 2019-11-30 10:08:53
I want know which programming language provides good number of libraries to program a web bot? Something like crawling a web page for data. Say I want fetch weather for weather.yahoo.com website. Also will the answer be same for a AI desktop bot? Here is how you could do it in Python: from urllib2 import urlopen from BeautifulSoup import BeautifulSoup soup=BeautifulSoup(urlopen("http://weather.yahoo.com/").read()) for x in soup.find(attrs={"id":"myLocContainer"}).findAll("li"): print x.a["title"], x.em.contents Prints: Full forecast for Chicago, Illinois, United States (Haze) [u'35...47 °F']

Ethics of robots.txt [closed]

余生颓废 提交于 2019-11-29 22:55:19
I have a serious question. Is it ever ethical to ignore the presence of a robots.txt file on a website? These are some of the considerations I've got in mind: If someone puts a web site up they're expecting some visits. Granted, web crawlers are using bandwidth without clicking on ads that may support the site but the site owner is putting their site on the web, right, so how reasonable is it for them to expect that they'll never get visited by a bot? Some sites apparently use a robots.txt exactly in order to keep their site from being crawled by Google or some other utility that might grab

Which is the best programming language to write a web bot? [closed]

我只是一个虾纸丫 提交于 2019-11-29 15:18:14
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 9 years ago . I want know which programming language provides good number of libraries to program a web bot? Something like crawling a web page for

Is the User-Agent line in robots.txt an exact match or a substring match?

百般思念 提交于 2019-11-29 15:05:43
When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent? Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match. However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from the RFC: The robot must obey the first record in /robots.txt that contains a User-Agent line whose

How to add `nofollow, noindex` all pages in robots.txt?

ⅰ亾dé卋堺 提交于 2019-11-29 14:31:57
I want to add nofollow and noindex to my site whilst it's being built. The client has request I use these rules. I am aware of <meta name="robots" content="noindex,nofollow"> But I only have access to the robots.txt file. Does anyone know the correct format I can use to apply noindex, nofollow rules via the robots.txt file? noindex and nofollow means you do not want your site to crawl in search engine. so simply put code in robots.txt User-agent: * Disallow: / it means noindex and nofollow. There is a non-standard Noindex field, which Google (and likely no other consumer) supported as

Stopping index of Github pages

一笑奈何 提交于 2019-11-29 06:54:30
问题 I have a github page from my repository username.github.io However I do not want Google to crawl my website and absolutely do not want it to show up on search results. Will just using robots.txt in github pages work? I know there are tutorials for stop indexing Github repository but what about the actual Github page? 回答1: Will just using robots.txt in github pages work? If you're using the default GitHub Pages subdomain, then no because Google would check https://github.io/robots.txt only.

how to disallow all dynamic urls robots.txt [closed]

拟墨画扇 提交于 2019-11-29 04:29:35
how to disallow all dynamic urls in robots.txt Disallow: /?q=admin/ Disallow: /?q=aggregator/ Disallow: /?q=comment/reply/ Disallow: /?q=contact/ Disallow: /?q=logout/ Disallow: /?q=node/add/ Disallow: /?q=search/ Disallow: /?q=user/password/ Disallow: /?q=user/register/ Disallow: /?q=user/login/ i want to disallow all things that start with /?q= The answer to your question is to use Disallow: /?q= The best (currently accessible) source on robots.txt I could find is on Wikipedia . (The supposedly definitive source is http://www.robotstxt.org , but site is down at the moment.) According to the

How do i configure nginx to redirect to a url for robots.txt & sitemap.xml

╄→гoц情女王★ 提交于 2019-11-28 22:37:57
问题 I am running nginx 0.6.32 as a proxy front-end for couchdb. I have my robots.txt in the database, reachable as http://www.example.com/prod/_design/mydesign/robots.txt. I also have my sitemap.xml which is dynamically generated, on a similar url. I have tried the following config: server { listen 80; server_name example.com; location / { if ($request_method = DELETE) { return 444; } if ($request_uri ~* "^/robots.txt") { rewrite ^/robots.txt http://www.example.com/prod/_design/mydesign/robots

What is the smartest way to handle robots.txt in Express?

删除回忆录丶 提交于 2019-11-28 19:04:24
I'm currently working on an application built with Express (Node.js) and I want to know what is the smartest way to handle different robots.txt for different environments (development, production). This is what I have right now but I'm not convinced by the solution, I think it is dirty: app.get '/robots.txt', (req, res) -> res.set 'Content-Type', 'text/plain' if app.settings.env == 'production' res.send 'User-agent: *\nDisallow: /signin\nDisallow: /signup\nDisallow: /signout\nSitemap: /sitemap.xml' else res.send 'User-agent: *\nDisallow: /' (NB: it is CoffeeScript) There should be a better way