robots.txt | 易学教程

Anybody got any C# code to parse robots.txt and evaluate URLS against it

阅读更多关于 Anybody got any C# code to parse robots.txt and evaluate URLS against it

Short question: Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not. Long question: I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode. The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet. I would

How can I serve robots.txt on an SPA using React with Firebase hosting?

阅读更多关于 How can I serve robots.txt on an SPA using React with Firebase hosting?

问题 I have an SPA built using create-react-app and wish to have a robots.txt like this: http://example.com/robots.txt I see on this page that: You need to make sure your server is configured to catch any URL after it's configured to serve from a directory. But for firebase hosting, I'm not sure what to do. 回答1: In my /public directory, I created a robots.txt . In my /src directory, I did the following: I created /src/index.js : import React from 'react' import ReactDOM from 'react-dom' import

Web Crawler - Ignore Robots.txt file?

阅读更多关于 Web Crawler - Ignore Robots.txt file?

问题 Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python. 回答1: The documentation for mechanize has this sample code: br = mechanize.Browser() .... # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) That does exactly what you want. 回答2: This looks like what you need: from mechanize import Browser br =

Ban robots from website [closed]

阅读更多关于 Ban robots from website [closed]

my website is often down because a spider is accessying to many resources. This is what the hosting told me. They told me to ban these IP address: 46.229.164.98 46.229.164.100 46.229.164.101 But I've no idea about how to do this. I've googled a bit and I've now added these lines to .htaccess in the root: # allow all except those indicated here <Files *> order allow,deny allow from all deny from 46.229.164.98 deny from 46.229.164.100 deny from 46.229.164.101 </Files> Is this 100% correct? What could I do? Please help me. Really I don't have any idea about what I should do. Sharky based on these

Sitemap for a site with a large number of dynamic subdomains

阅读更多关于 Sitemap for a site with a large number of dynamic subdomains

问题 I'm running a site which allows users to create subdomains. I'd like to submit these user subdomains to search engines via sitemaps. However, according to the sitemaps protocol (and Google Webmaster Tools), a single sitemap can include URLs from a single host only. What is the best approach? At the moment I've the following structure: Sitemap index located at example.com/sitemap-index.xml that lists sitemaps for each subdomain (but located at the same host). Each subdomain has its own sitemap

How to set up a robot.txt which only allows the default page of a site

阅读更多关于 How to set up a robot.txt which only allows the default page of a site

问题 Say I have a site on http://example.com. I would really like allowing bots to see the home page, but any other page need to blocked as it is pointless to spider. In other words http://example.com & http://example.com/ should be allowed, but http://example.com/anything and http://example.com/someendpoint.aspx should be blocked. Further it would be great if I can allow certain query strings to passthrough to the home page: http://example.com?okparam=true but not http://example.com

robots.txt allow root only, disallow everything else?

阅读更多关于 robots.txt allow root only, disallow everything else?

问题 I can't seem to get this to work but it seems really basic. I want the domain root to be crawled http://www.example.com But nothing else to be crawled and all subdirectories are dynamic http://www.example.com/* I tried User-agent: * Allow: / Disallow: /*/ but the Google webmaster test tool says all subdirectories are allowed. Anyone have a solution for this? Thanks :) 回答1: According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow

robots.txt to disallow all pages except one? Do they override and cascade?

阅读更多关于 robots.txt to disallow all pages except one? Do they override and cascade?

问题 I want one page of my site to be crawled and no others. Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is. # robots.txt for http://example.com/ User-agent: * Disallow: /style-guide Disallow: /splash Disallow: /etc Disallow: /etc Disallow: /etc Disallow: /etc Disallow: /etc Or can I do like this? # robots.txt for http://example.com/ User-agent: * Disallow: / Allow: /under-construction Also I

django serving robots.txt efficiently

阅读更多关于 django serving robots.txt efficiently

问题 Here is my current method of serving robots.txt url(r'^robots\.txt/$', TemplateView.as_view(template_name='robots.txt', content_type='text/plain')), I don't think that this is the best way. I think it would be better if it were just a pure static resource and served statically. But the way my django app is structured is that the static root and all subsequent static files are located in http://my.domain.com/static/stuff-here Any thoughts? I'm amateur at django but TemplateView.as_view

Web Crawler - Ignore Robots.txt file?

阅读更多关于 Web Crawler - Ignore Robots.txt file?

Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python. The documentation for mechanize has this sample code: br = mechanize.Browser() .... # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) That does exactly what you want. This looks like what you need: from mechanize import Browser br = Browser() # Ignore robots.txt br.set_handle_robots( False ) but you know what you're doing… 来源： https:/