robots.txt

How to stop search engines from crawling the whole website?

非 Y 不嫁゛ 提交于 2020-01-12 03:13:08
问题 I want to stop search engines from crawling my whole website. I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful. So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the

How to stop search engines from crawling the whole website?

狂风中的少年 提交于 2020-01-12 03:13:05
问题 I want to stop search engines from crawling my whole website. I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful. So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the

Python, Mechanize - request disallowed by robots.txt even after set_handle_robots and add_headers

匆匆过客 提交于 2020-01-11 10:01:09
问题 I have made a web crawler which gets all links till the 1st level of page and from them it gets all link and text plus imagelinks and alt. here is whole code: import urllib import re import time from threading import Thread import MySQLdb import mechanize import readability from bs4 import BeautifulSoup from readability.readability import Document import urlparse url = ["http://sparkbrowser.com"] i=0 while i<len(url): counterArray = [0] levelLinks = [] linkText = ["homepage"] levelLinks = []

Disallow all for all user agents except one user agent?

余生长醉 提交于 2020-01-07 08:08:24
问题 How to disallow all for all user agents except one user agent? For example disallow all for all user agent, but allow for Googlebot only? 回答1: User-agent: * Disallow: / User-agent: google Allow: / This sample robots.txt tells crawlers that if they are not with google. then it is preferred they don't crawl your site. While google has been given the greenpass to crawl anything on the site. This file should be stored at www.example.com/robots.txt. Please read up on robots.txt 来源: https:/

how to bypass robots.txt with apache nutch 2.2.1

北城余情 提交于 2020-01-07 06:46:10
问题 Can anyone please tell me if there is any way for apache nutch to ignore or bypass robots.txt while crawling. I am using nutch 2.2.1. I found that "RobotRulesParser.java"(full path:-src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ RobotRulesParser.java) is responsible for the reading and parsing the robots.txt. Is there any way to modify this file to ignore robots.txt and go on with crawling? Or is there any other way to achieve the same? 回答1: At first, we should respect the

where to put robots.txt for a CodeIgniter

徘徊边缘 提交于 2020-01-06 03:31:50
问题 Where to place the robots.txt file in codeigniter. I dont no how to put where folder. User-agent: * Disallow: / 回答1: The robots.txt file MUST be placed in the document root of the host. It will not work in other locations. If your host is example.com, it needs to be accessible at http://example.com/robots.txt. 来源: https://stackoverflow.com/questions/30970833/where-to-put-robots-txt-for-a-codeigniter

React router v4 serve static file (robot.txt)

喜你入骨 提交于 2020-01-04 02:05:10
问题 How can I put my robots.txt file to the path www.domain.com/robots.txt? No server is used, its only frontend with react router. robots.txt --> in root folder ./ app.js --> in src folder ./src/ (...) export class App extends React.Component { render() { return ( <div> <Switch> <Route exact path='/stuff' component={Stuff}/> <Route exact path='/' component={HomePage}/> </Switch> </div> ) } } If I test it locally, it works OK, localhost:4000/robots.txt opens the file properly in the browser.

Subdomain disallow search bots via robots.txt

可紊 提交于 2020-01-03 03:19:04
问题 I want to disallow search robots to access the entire domain including subdomains using Robots.txt and potentially .htaccess I want to make sure that any new subdomains in future are blocked without having to create one in the root of subdomain every time. Is this possible? 回答1: If you want to block robots via robots.txt, you'll have to create one for each subdomain. I suggest a script that monitors your Zone File and then automatically creates one. Another solution is to use HTTP Basic Auth.

How to replace robots.txt with .htaccess

空扰寡人 提交于 2020-01-02 10:09:18
问题 I have a small situation where i have to remove my robots.txt file because i don't want and robots crawlers to get the links. Also i want them to be accessible by the user and i don't want them to be cached by the search engines. Also i cannot add any user authentications for various reasons. So i am thinking about using mod-rewrite to disable search engine crawlers from crawling it while allowing all others to do it. The logic i am trying to implement is write a condition to check if the

How to set Robots.txt or Apache to allow crawlers only at certain hours?

浪子不回头ぞ 提交于 2020-01-02 05:05:10
问题 As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours. Is there a method to achieve this? edit: thanks for all the good advice. This is another solution we found. 2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses. the article the setting of IPTables: Using connlimit In newer Linux kernels, there is a connlimit module for iptables. It can be used