Compare strings with non-English characters?

时光总嘲笑我的痴心妄想 提交于 2019-12-02 01:25:16

For comparing non-English characters properly you should use appropriate culture rules for this. E.g. you could create your own case-insensitive StringComparer for Swedish and use it in Contains method:

var swedishComparer = StringComparer.Create(new CultureInfo("sv-Se"), true);

consultants = consultants
    .Where(x => 
        x.Description.Contains(vm.Description, swedishComparer)
    ).ToList();

Use

String.Equals(c, vm, StringComparison.OrdinalIgnoreCase)

or

c.IndexOf(vm, StringComparison.OrdinalIgnoreCase)

Ordinal means Unicode, byte-per-byte, culture-independent comparison.

Here is an introduction to the character set problem by Joel Spolsky. A very interesting read.

In short, the web page needs to tell you what character set it is using at the very beginning of the page. C# is using unicode (In UTF-16 encoding as standard) for strings, a explanation what that means can you find here in csharp in depth

Hope this will help you.

What do you search on ? On an xml file, on a db4o file, on sql ? The character coding of your database is important. You can handle with it at xml setting its utf-coding; and db4o it is already safe works on object, on sql side you have to set the charachter encoding.

if you database is holding values as char(50) or varchar(50) it may miss different characters, to hold different characters such you should use nchar, nvarchar at your sql-database. Do not forget to check your database character coding, even it is not much neccessary

What kind of list are you working on? A plain list or an ORM? use string.Compare() if it's a plain list.

Indexing is a big part of searching. I think you would be best served by using something ready and solid, like Lucene or Solr.

If you still insist on searching using regexes on non-ascii characters, you should probably learn more on unicode categories and then use them to strip any accent marks (for example, strip with \p{P} or \p{M}) before searching for that word in the text.

Note: You will also probably need to normalize your strings using the FormC flag in order to decompose and strip/search more effectively

Thanks to all who offered suggestions, but unfortunately they seem to be irrelevant. As it turns out Contains() has no problem with non-English characters at all. The problem was that the database field in question had html encoded text, so I needed to use HtmlDecode to compare the strings in the controller:

        if (vm.Description != "")
        {
            //HttpUtility.HtmlDecode needed because text in Description field is HtmlEncoded!
            consultants = consultants.Where(x => HttpUtility.HtmlDecode(x.Description).ContainsCaseInsensitive(vm.Description)).ToList();
        }

I discovered this because the Contains() code worked fine when searching another field with non-English characters.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!