Compare strings with non-English characters?

笑着哭i 提交于 2020-01-21 12:56:13

问题


I need to compare strings for a search mechanism on a web site. I use C#. I tried two ways:

consultants.Where(x => 
    x.Description.ToLower().Contains(vm.Description.ToLower()));

and

consultants.Where(x => 
    Regex.IsMatch(x.Description, vm.Description, RegexOptions.IgnoreCase));

Both work fine for all English characters. So if I search for, say, "english", that's no problem. But as soon as I try searching for a string that contains non-English characters, it doesn't work. For example, if I try searching for the word "språk" ("language" in Swedish) it returns nothing.

Why is that, and how can I solve it?


回答1:


Use

String.Equals(c, vm, StringComparison.OrdinalIgnoreCase)

or

c.IndexOf(vm, StringComparison.OrdinalIgnoreCase)

Ordinal means Unicode, byte-per-byte, culture-independent comparison.




回答2:


For comparing non-English characters properly you should use appropriate culture rules for this. E.g. you could create your own case-insensitive StringComparer for Swedish and use it in Contains method:

var swedishComparer = StringComparer.Create(new CultureInfo("sv-Se"), true);

consultants = consultants
    .Where(x => 
        x.Description.Contains(vm.Description, swedishComparer)
    ).ToList();



回答3:


Here is an introduction to the character set problem by Joel Spolsky. A very interesting read.

In short, the web page needs to tell you what character set it is using at the very beginning of the page. C# is using unicode (In UTF-16 encoding as standard) for strings, a explanation what that means can you find here in csharp in depth

Hope this will help you.




回答4:


What do you search on ? On an xml file, on a db4o file, on sql ? The character coding of your database is important. You can handle with it at xml setting its utf-coding; and db4o it is already safe works on object, on sql side you have to set the charachter encoding.

if you database is holding values as char(50) or varchar(50) it may miss different characters, to hold different characters such you should use nchar, nvarchar at your sql-database. Do not forget to check your database character coding, even it is not much neccessary




回答5:


What kind of list are you working on? A plain list or an ORM? use string.Compare() if it's a plain list.




回答6:


Indexing is a big part of searching. I think you would be best served by using something ready and solid, like Lucene or Solr.

If you still insist on searching using regexes on non-ascii characters, you should probably learn more on unicode categories and then use them to strip any accent marks (for example, strip with \p{P} or \p{M}) before searching for that word in the text.

Note: You will also probably need to normalize your strings using the FormC flag in order to decompose and strip/search more effectively




回答7:


Thanks to all who offered suggestions, but unfortunately they seem to be irrelevant. As it turns out Contains() has no problem with non-English characters at all. The problem was that the database field in question had html encoded text, so I needed to use HtmlDecode to compare the strings in the controller:

        if (vm.Description != "")
        {
            //HttpUtility.HtmlDecode needed because text in Description field is HtmlEncoded!
            consultants = consultants.Where(x => HttpUtility.HtmlDecode(x.Description).ContainsCaseInsensitive(vm.Description)).ToList();
        }

I discovered this because the Contains() code worked fine when searching another field with non-English characters.



来源:https://stackoverflow.com/questions/5578304/compare-strings-with-non-english-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!