I am making a quiz system, and when quizmakers insert questions into the Question Bank, I am to check the DB for duplicate / very highly similar questions.
Testing M
The basic data structure for a text retrieval system is an Inverted Index. This is essentially a list of words found in the document collection with a list of the documents they occur in. It can also have metadata about the occurrence for each document, such as the number of times the word appears.
Documents containing the words can be queried by matching on the search terms. To determine relevance, a heuristic known as a Cosine Ranking is calculated on the hits. This works by constructing n-dimensional vector with one component for each of the n search terms. You can also weight the search terms if desired. This vector gives a point in n-dimensional space that corresponds to your search terms.
A similar vector based on the weighted occurrences in each document can be constructed from the inverted index with each axis in the vector corresponding with the axis for each search term. If you calculate a dot product of these vectors you get the cosine of the angle between them. 1.0 is equivalent to cos (0), which would assume the vectors occupy a common line from the origin. The closer the vectors together, the smaller the angle and the closer the cosine is to 1.0.
If you sort the search results by the cosine (or bung them into a priority queue as mg does) you get the most relevant. Cleverer relevance algorithms tend to fiddle with the weights of the search terms, skewing the dot product in favour of terms with high relevance.
If you want to dig a little, Managing Gigabytes by Bell and Moffet discusses the internal architecture of text retrieval systems.