search-engine

Methods for preventing search engines from indexing irrelevant content on a page

这一生的挚爱 提交于 2019-11-27 23:34:29
问题 I'm looking for ways to prevent indexing of parts of a page. Specifically, comments on a page, since they weigh up entries a lot based on what users have written. This makes a Google search on the page return lots of irrelevant pages. Here are the options I'm considering so far: 1) Load comments using JavaScript to prevent search engines from seeing them. 2) Use user agent sniffing to simply not output comments for crawlers. 3) Use search engine-specific markup to hide parts of the page. This

Designing a web crawler

戏子无情 提交于 2019-11-27 16:36:27
I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it. How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages. What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc.. I have taken Google

Which are the best alternatives to Lucene? [closed]

可紊 提交于 2019-11-27 12:23:19
It may run on Unix and it will be used for email searching (Dovecot, Postfix and maildir). Lucene is not a problem, I'm just analyzing some alternatives. vartec For simple things native full-text search of your RDBMS. Full text search in PostgreSQL FTS2 in SQLite Full text search in MySQL Oracle Text in Oracle DB Full text search in Microsoft SQL Server would need to know what problems you're having with Lucene, but Xapian is worth a look. The ones I can come up with now is native DBMS-full-text-indexing (MSSQL, MySQL both has implementations for it) aswell as Sphinx http://www.sphinxsearch

Google-like Search Engine in PHP/mySQL [closed]

时光总嘲笑我的痴心妄想 提交于 2019-11-27 10:21:02
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . We have OCRed thousands of pages of newspaper articles. The newspaper, issue, date, page number and OCRed text of each page has been put into a mySQL database. We now want to build a Google-like search engine in PHP to find the pages given a query. It's got to be fast, and take no

Do Google or other search engines execute JavaScript?

天大地大妈咪最大 提交于 2019-11-27 09:30:27
I am just wondering if Google or other search engines execute JavaScript on your web page. For example, if you set the title tag using JavaScript, does the Google search engine see that? David There have been some experiments performed for SEO purposes which indicate that at least the big players (Google, for example) can and do follow some simple JavaScript. They avoid sneaky redirects and such, but some basic content manipulation does seem to get through. (I don't have a link handy for Google themselves confirming or denying this, it's just various posts I've come across when dealing with

Building a web search engine [closed]

家住魔仙堡 提交于 2019-11-27 09:01:32
问题 I've always been interested in developing a web search engine. What's a good place to start? I've heard of Lucene, but I'm not a big Java guy. Any other good resources or open source projects? I understand it's a huge under-taking, but that's part of the appeal. I'm not looking to create the next Google, just something I can use to search a sub-set of sites that I might be interested in. 回答1: There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks,

Can search engines index JavaScript generated web pages?

帅比萌擦擦* 提交于 2019-11-27 08:39:05
Can search engines such as Google index JavaScript generated web pages? When you right click and select view source in a page that is generated by JavaScript (e.g using GWT) you do not see the dynamically generated HTML. I suppose that if a search engine also cannot see the generated HTML then there is not much to index, right? Your suspicion is correct - JS-generated content cannot be relied on to be visible to search bots. It also can't be seen by anyone with JS turned off - and, last time I added some tests to a site I was working on (which was a large, mainstream-audience site, with

Marking up a search result list with HTML5 semantics

南笙酒味 提交于 2019-11-27 08:36:47
Making a search result list (like in Google) is not very hard, if you just need something that works. Now, however, I want to do it with perfection, using the benefits of HTML5 semantics. The goal is to define the defacto way of marking up a search result list that potentially could be used by any future search engine. For each hit, I want to order them by increasing number display a clickable title show a short summary display additional data like categories, publishing date and file size My first idea is something like this: <ol> <li> <article> <header> <h1> <a href="url-to-the-page.html">

How to prevent search engines from indexing a single page of my website?

故事扮演 提交于 2019-11-27 05:39:52
问题 I don't want the search engines to index my imprint page. How could I do that? 回答1: You need a simple robots.txt file. Basically, it's a text file that tells search engines not to index particular pages. You don't need to include it in the header of your page; as long as it's in the root directory of your website it will be picked up by crawlers. Create it in the root folder of your website and put the following text in: User-Agent: * Disallow: /imprint-page.htm Note that you'd replace

what is the fastest substring search method in Java

↘锁芯ラ 提交于 2019-11-27 04:37:24
问题 I need to implement a way to search substring (needles) in a list of string (haystack) using Java. More specifically, my app has a list of user profiles. If I type some letters, for example, "Ja", and then search, then all the users whose name contains "ja" should show up. For instance, the result could be "Jack", "Jackson", "Jason", "Dijafu". In Java, as I know, there are 3 build-in method to see search substring in a string. string.contains() string.indexOf() regular expression. it is