search-engine

Designing a web crawler

半城伤御伤魂 提交于 2019-11-27 04:09:24
问题 I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it. How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages. What if the

Redirect to 404 page or display 404 message?

戏子无情 提交于 2019-11-27 03:33:01
问题 I am using a cms, and file-not-found errors can be handled in different ways: The page will not be redirected, but an error-msg will be displayed as content (using the default layout with menu/footer). The page will be redirected to error.php (the page looks the same like 1. but the address changed) The page will be redirected to an existing page, e.g. sitemap.php Is there a method to be preferred in regards to search engines, or does this make no difference? 回答1: If it's not found, then you

How can i generate an API key for Baidu China for an website store locator?

萝らか妹 提交于 2019-11-27 02:57:19
问题 I have been asked by our developers to give them an API key for Baidu maps so they can set up our on site store locator and I'm not really sure how to go about doing this. I tried to set up an account on Baidu but it asked for a chinese mobile number. Do I have to get one of these before I can get the key? And how easy is it to work out how to obtain the key once i've got an account? Can anyone advise on the best way to set this up? Thanks in advance! 回答1: Update 2016: It now appears to be

Can search engine spiders see content I add using jQuery?

落花浮王杯 提交于 2019-11-27 02:34:51
问题 I currently have something like this <p class="test"></p> <script type="text/javascript"> $(document).ready(function() { $(".test").html("hey"); }); </script> Will search engines be able to spider the "hey" text? and if yes, what method can I use to prevent that. 回答1: Despite what is being stated here in other answers and totally contrary to Google's own FAQ, a Google employee named JohnMu answered a question recently in Google Groups about how the GoogleBot came to follow a non-existent URL.

What is the difference between web-crawling and web-scraping? [duplicate]

孤街醉人 提交于 2019-11-26 23:59:45
问题 This question already has an answer here: crawler vs scraper 4 answers Is there a difference between Crawling and Web-scraping? If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised search engine? 回答1: Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite

What's the best Django search app? [closed]

安稳与你 提交于 2019-11-26 23:45:35
问题 I'm building a Django project that needs search functionality, and until there's a django.contrib.search , I have to choose a search app. So, which is the best? By "best" I mean... easy to install / set up has a Django- or at least Python-friendly API can perform reasonably complex searches Here are some apps I've heard of, please suggest others if you know of any: djangosearch django-sphinx I'd also like to avoid using a third-party search engine (like Google SiteSearch), because some of the

Which are the best alternatives to Lucene? [closed]

雨燕双飞 提交于 2019-11-26 22:22:04
问题 It may run on Unix and it will be used for email searching (Dovecot, Postfix and maildir). Lucene is not a problem, I'm just analyzing some alternatives. 回答1: For simple things native full-text search of your RDBMS. Full text search in PostgreSQL FTS2 in SQLite Full text search in MySQL Oracle Text in Oracle DB Full text search in Microsoft SQL Server 回答2: would need to know what problems you're having with Lucene, but Xapian is worth a look. 回答3: The ones I can come up with now is native

How do I save the origin html file with Apache Nutch

蹲街弑〆低调 提交于 2019-11-26 21:40:27
问题 I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch? Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.) 回答1: Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will

Block all bots/crawlers/spiders for a special directory with htaccess

老子叫甜甜 提交于 2019-11-26 18:59:56
I'm trying to block all bots/crawlers/spiders for a special directory. How can I do that with htaccess ? I searched a little bit and found a solution by blocking based on the user agent: RewriteCond %{HTTP_USER_AGENT} googlebot Now I would need more user agents (for all bots known) and the rule should be only valid for my separate directory. I have already a robots.txt but not all crawlers take a look at it ... Blocking by IP address is not an option. Or are there other solutions? I know the password protection but I have to ask first if this would be an option. Nevertheless, I look for a

Do Google or other search engines execute JavaScript?

时间秒杀一切 提交于 2019-11-26 14:46:47
问题 I am just wondering if Google or other search engines execute JavaScript on your web page. For example, if you set the title tag using JavaScript, does the Google search engine see that? 回答1: There have been some experiments performed for SEO purposes which indicate that at least the big players (Google, for example) can and do follow some simple JavaScript. They avoid sneaky redirects and such, but some basic content manipulation does seem to get through. (I don't have a link handy for