问题
In a pretty typical scenario, I have a 'Search' text box on my web application which has user input passed directly to a stored procedure which then uses full text indexing to search on two fields in two tables, which are joined using appropriate keys.
I am using the CONTAINS predicate to search the fields. Before passing the search string in, I do the following:
SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'
Changing the castle to "the*" OR "castle*", for example. This is necessary because I want people to be able to search on cas and get results for castle.
WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)
The problem is that now that I have appended a wildcard to the end of each word, noise words (e.g. the) also have a wildcard appended and therefore no longer appear to get dropped. This means that a search for the castle will return items with words such as theatre etc.
Changing OR to AND was my first thought, but that appears to simply return no matches if a noise word is then used in the query.
All I am trying to achieve is to allow the user to enter multiple, space separated words that respresent either the entirety or a prefix of the words they are searching on, in any order - and drop noise words such as the from their input (otherwise when they search for the castle they get a big list of items with the result they need somewhere in the middle of the list.
I could go ahead and implement my own noise word removal procedure, but it seems like something that full text indexing ought to be able to handle.
Grateful for any help!
Jamie
回答1:
Noise words are stripped out before the indexing is stored. So it is impossible to write a query that searches on a stop word. If you REALLY want to enable this behavior, you need to edit the list of stop words. (http://msdn.microsoft.com/en-us/library/ms142551.aspx) and then re-build your index.
回答2:
I had the same question and after a thorough search I've come to the conclusion there's no good solution.
As a compromise, I'm implementing the brute force solution:
1) Open C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseENU.txt and copy all the text in there.
2) Paste into a code file in the application, replace line breaks with "," to get a List initializer like this:
public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };
3) Before you submit the search string, break it into words and remove any words in the noise words like this:
List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
if (!_noiseWords.Contains(word))
goodWords.Add(word);
}
Not an ideal solution, but should work as long as the noise words file doesn't change. Multiple language support would use a dictionary of lists by language.
回答3:
Here's a working function. The file noiseENU.txt
is copied as-is from \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData
.
Public Function StripNoiseWords(ByVal s As String) As String
Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
Return Result
End Function
回答4:
You can also remove noise words BEFORE making the query. List of language id: http://msdn.microsoft.com/en-us/library/ms190303.aspx
Dim queryTextWithoutNoise As String = removeNoiseWords(queryText, ConnectionString, 1033)
Public Function removeNoiseWords(ByVal inputText As String, ByVal cnStr As String, ByVal languageID As Integer) As String
Dim r As New System.Text.StringBuilder
Try
If inputText.Contains(CChar("""")) Then
r.Append(inputText)
Else
Using cn As New SqlConnection(cnStr)
Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
cn.Open()
Dim cmd As New SqlCommand(q, cn)
With cmd.Parameters
.Add(New SqlParameter("@q", """" & inputText & """"))
.Add(New SqlParameter("@l", languageID))
End With
Dim dr As SqlDataReader = cmd.ExecuteReader
While dr.Read
If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
r.Append(dr.Item("display_term").ToString)
r.Append(" ")
End If
End While
End Using
End If
Catch ex As Exception
' ...
End Try
Return r.ToString
End Function
回答5:
Similar to my approach.
While I wish to use Full Text Indexing for its ability to perform stemming, its speed and multiple word search and so on I'm actually only indexing a couple of nvarchar(100) fields in two tables. Each table will easily remain below 50,000 rows.
My solution was to remove all noise words from the text file and allow the indexer to compile an index including all words. It still only consists of a few thousand entries.
I then do the replace on spaces in the search string as described in my original post to get CONTAINS to work on multiple words, and to stem words individually.
Seems to work really well, but I will be keeping a close eye on performance.
来源:https://stackoverflow.com/questions/469516/dropping-noise-words-in-sql-server-2005-full-text-indexing