robots.txt

Anybody got any C# code to parse robots.txt and evaluate URLS against it

[亡魂溺海] 提交于 2019-12-04 05:49:36
Short question: Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not. Long question: I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode. The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet. I would

How can I serve robots.txt on an SPA using React with Firebase hosting?

不打扰是莪最后的温柔 提交于 2019-12-03 17:47:52
问题 I have an SPA built using create-react-app and wish to have a robots.txt like this: http://example.com/robots.txt I see on this page that: You need to make sure your server is configured to catch any URL after it's configured to serve from a directory. But for firebase hosting, I'm not sure what to do. 回答1: In my /public directory, I created a robots.txt . In my /src directory, I did the following: I created /src/index.js : import React from 'react' import ReactDOM from 'react-dom' import

Web Crawler - Ignore Robots.txt file?

风格不统一 提交于 2019-12-03 16:32:08
问题 Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python. 回答1: The documentation for mechanize has this sample code: br = mechanize.Browser() .... # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) That does exactly what you want. 回答2: This looks like what you need: from mechanize import Browser br =

Ban robots from website [closed]

坚强是说给别人听的谎言 提交于 2019-12-03 15:35:59
my website is often down because a spider is accessying to many resources. This is what the hosting told me. They told me to ban these IP address: 46.229.164.98 46.229.164.100 46.229.164.101 But I've no idea about how to do this. I've googled a bit and I've now added these lines to .htaccess in the root: # allow all except those indicated here <Files *> order allow,deny allow from all deny from 46.229.164.98 deny from 46.229.164.100 deny from 46.229.164.101 </Files> Is this 100% correct? What could I do? Please help me. Really I don't have any idea about what I should do. Sharky based on these

Sitemap for a site with a large number of dynamic subdomains

房东的猫 提交于 2019-12-03 13:33:43
问题 I'm running a site which allows users to create subdomains. I'd like to submit these user subdomains to search engines via sitemaps. However, according to the sitemaps protocol (and Google Webmaster Tools), a single sitemap can include URLs from a single host only. What is the best approach? At the moment I've the following structure: Sitemap index located at example.com/sitemap-index.xml that lists sitemaps for each subdomain (but located at the same host). Each subdomain has its own sitemap

How to set up a robot.txt which only allows the default page of a site

家住魔仙堡 提交于 2019-12-03 08:35:24
问题 Say I have a site on http://example.com. I would really like allowing bots to see the home page, but any other page need to blocked as it is pointless to spider. In other words http://example.com & http://example.com/ should be allowed, but http://example.com/anything and http://example.com/someendpoint.aspx should be blocked. Further it would be great if I can allow certain query strings to passthrough to the home page: http://example.com?okparam=true but not http://example.com

robots.txt allow root only, disallow everything else?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-03 08:12:46
问题 I can't seem to get this to work but it seems really basic. I want the domain root to be crawled http://www.example.com But nothing else to be crawled and all subdirectories are dynamic http://www.example.com/* I tried User-agent: * Allow: / Disallow: /*/ but the Google webmaster test tool says all subdirectories are allowed. Anyone have a solution for this? Thanks :) 回答1: According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow

robots.txt to disallow all pages except one? Do they override and cascade?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-03 08:12:23
问题 I want one page of my site to be crawled and no others. Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is. # robots.txt for http://example.com/ User-agent: * Disallow: /style-guide Disallow: /splash Disallow: /etc Disallow: /etc Disallow: /etc Disallow: /etc Disallow: /etc Or can I do like this? # robots.txt for http://example.com/ User-agent: * Disallow: / Allow: /under-construction Also I

django serving robots.txt efficiently

北城余情 提交于 2019-12-03 07:58:37
问题 Here is my current method of serving robots.txt url(r'^robots\.txt/$', TemplateView.as_view(template_name='robots.txt', content_type='text/plain')), I don't think that this is the best way. I think it would be better if it were just a pure static resource and served statically. But the way my django app is structured is that the static root and all subsequent static files are located in http://my.domain.com/static/stuff-here Any thoughts? I'm amateur at django but TemplateView.as_view

Web Crawler - Ignore Robots.txt file?

梦想的初衷 提交于 2019-12-03 06:42:32
Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python. The documentation for mechanize has this sample code: br = mechanize.Browser() .... # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) That does exactly what you want. This looks like what you need: from mechanize import Browser br = Browser() # Ignore robots.txt br.set_handle_robots( False ) but you know what you're doing… 来源: https:/