问题
I need to compare strings for a search mechanism on a web site. I use C#. I tried two ways:
consultants.Where(x =>
x.Description.ToLower().Contains(vm.Description.ToLower()));
and
consultants.Where(x =>
Regex.IsMatch(x.Description, vm.Description, RegexOptions.IgnoreCase));
Both work fine for all English characters. So if I search for, say, "english", that's no problem. But as soon as I try searching for a string that contains non-English characters, it doesn't work. For example, if I try searching for the word "språk" ("language" in Swedish) it returns nothing.
Why is that, and how can I solve it?
回答1:
Use
String.Equals(c, vm, StringComparison.OrdinalIgnoreCase)
or
c.IndexOf(vm, StringComparison.OrdinalIgnoreCase)
Ordinal
means Unicode, byte-per-byte, culture-independent comparison.
回答2:
For comparing non-English characters properly you should use appropriate culture rules for this. E.g. you could create your own case-insensitive StringComparer for Swedish and use it in Contains
method:
var swedishComparer = StringComparer.Create(new CultureInfo("sv-Se"), true);
consultants = consultants
.Where(x =>
x.Description.Contains(vm.Description, swedishComparer)
).ToList();
回答3:
Here is an introduction to the character set problem by Joel Spolsky. A very interesting read.
In short, the web page needs to tell you what character set it is using at the very beginning of the page. C# is using unicode (In UTF-16 encoding as standard) for strings, a explanation what that means can you find here in csharp in depth
Hope this will help you.
回答4:
What do you search on ? On an xml file, on a db4o file, on sql ? The character coding of your database is important. You can handle with it at xml setting its utf-coding; and db4o it is already safe works on object, on sql side you have to set the charachter encoding.
if you database is holding values as char(50) or varchar(50) it may miss different characters, to hold different characters such you should use nchar, nvarchar at your sql-database. Do not forget to check your database character coding, even it is not much neccessary
回答5:
What kind of list are you working on? A plain list or an ORM? use string.Compare()
if it's a plain list.
回答6:
Indexing is a big part of searching. I think you would be best served by using something ready and solid, like Lucene or Solr.
If you still insist on searching using regexes on non-ascii characters, you should probably learn more on unicode categories and then use them to strip any accent marks (for example, strip with \p{P}
or \p{M}
) before searching for that word in the text.
Note: You will also probably need to normalize your strings using the FormC flag in order to decompose and strip/search more effectively
回答7:
Thanks to all who offered suggestions, but unfortunately they seem to be irrelevant. As it turns out Contains() has no problem with non-English characters at all. The problem was that the database field in question had html encoded text, so I needed to use HtmlDecode to compare the strings in the controller:
if (vm.Description != "")
{
//HttpUtility.HtmlDecode needed because text in Description field is HtmlEncoded!
consultants = consultants.Where(x => HttpUtility.HtmlDecode(x.Description).ContainsCaseInsensitive(vm.Description)).ToList();
}
I discovered this because the Contains() code worked fine when searching another field with non-English characters.
来源:https://stackoverflow.com/questions/5578304/compare-strings-with-non-english-characters