What are the consequences of storing a C# string (UTF-16) in a SQL Server nvarchar (UCS-2) column?

China☆狼群 提交于 2019-12-03 00:03:42

It's all a bit of a fudge really.

First the similarities

  • The SQL Server nchar/nvarchar/ntext data types store text as a string of 2-byte characters. It doesn't really care what you put in them until you come to do searching and sorting (then it uses the appropriate Unicode collation sequence).
  • The CLR String data type also stores text as a string of 2-byte Chars. It also doesn't really care what you put in it until you come to do searching and sorting (then it uses the appropriate culture-specific methods).

Now the differences

  • .NET allows you to access the actual Unicode code points in a CLR string via the StringInfo class.
  • .NET has tons of support for encoding and decoding text data in a variety of encodings. When converting an arbitrary byte stream to a String, it will always encode the string as UTF-16 (with full multilingual plane support).

In short, as long as you treat both CLR and SQL Server string variables as whole blobs of text, then you can freely assign from one to the other with no loss of information. The underlying storage format is exactly the same, even though the abstractions layered on top are slightly different.

I don't expect that treating the text as UCS-2 would cause many problems.

Case conversions should not be a problem, because (AFAIK) there are no case mappings above the BMP (except the identity mapping, of course!), and, obviously, the surrogate characters are going to map to themselves.

Blanking every other character is just asking for trouble. In reality, doing these sorts of transformations without consideration of the character values is always a dangerous activity. I can see it happening legitimately with string truncations. But if any unmatched surrogates show up in the result, this itself is not a huge problem. Any system that receives such data—and cares—will probably just replace the unmatched surrogate with a replacement character, if it bothers to do anything about it at all.

Obviously, string length is going to be bytes/2 rather than number-of-characters, but number-of-characters is not a very useful value anyway, once you start plumbing the depths of the Unicode code charts. For example, you aren't going to get good results in monospaced display once you leave the ASCII range, because of combining characters, RTL languages, directional control characters, tags, and several kinds of space characters. The high code points are going to be the least of your problems.

Just to be on the safe side, you should probably store your cuneiform texts in a different column than the archeologist's names. :D

UPDATE now with empirical data!

I just ran a test to see what happens with case transformations. I created a string with the English word TEST in uppercase twice—first in Latin script, then in Deseret script. I applied a lower-case transformation to this string in .NET and in SQL Server.

The .NET version correctly lowercased all the letters in both scripts. The SQL Server version only lowercased the Latin characters and left the Deseret characters unchanged. This meets with expectations regarding the handling of UTF-16 verses UCS-2.

using System;
using System.Data.SqlClient;

class Program
{
    static void Main(string[] args)
    {
        string myDeseretText = "TEST\U00010413\U00010407\U0001041D\U00010413";
        string dotNetLower = myDeseretText.ToLower();
        string dbLower = LowercaseInDb(myDeseretText);

        Console.WriteLine("  Original: {0}", DisplayUtf16CodeUnits(myDeseretText));
        Console.WriteLine(".NET Lower: {0}", DisplayUtf16CodeUnits(dotNetLower));
        Console.WriteLine("  DB Lower: {0}", DisplayUtf16CodeUnits(dbLower));
        Console.ReadLine();
    }

    private static string LowercaseInDb(string value)
    {
        SqlConnectionStringBuilder connection = new SqlConnectionStringBuilder();
        connection.DataSource = "(local)";
        connection.IntegratedSecurity = true;
        using (SqlConnection conn = new SqlConnection(connection.ToString()))
        {
            conn.Open();
            string commandText = "SELECT LOWER(@myString) as LoweredString";
            using (SqlCommand comm = new SqlCommand(commandText, conn))
            {
                comm.CommandType = System.Data.CommandType.Text;
                comm.Parameters.Add("@myString", System.Data.SqlDbType.NVarChar, 100);
                comm.Parameters["@myString"].Value = value;
                using (SqlDataReader reader = comm.ExecuteReader())
                {
                    reader.Read();
                    return (string)reader["LoweredString"];
                }
            }
        }
    }

    private static string DisplayUtf16CodeUnits(string value)
    {
        System.Text.StringBuilder sb = new System.Text.StringBuilder();

        foreach (char c in value)
            sb.AppendFormat("{0:X4} ", (int)c);
        return sb.ToString();
    }
}

Output:

  Original: 0054 0045 0053 0054 D801 DC13 D801 DC07 D801 DC1D D801 DC13
.NET Lower: 0074 0065 0073 0074 D801 DC3B D801 DC2F D801 DC45 D801 DC3B
  DB Lower: 0074 0065 0073 0074 D801 DC13 D801 DC07 D801 DC1D D801 DC13

Just in case anyone has a Deseret font installed, here are the actual strings for your enjoyment:

  Original: TEST𐐓𐐇𐐝𐐓
.NET Lower: test𐐻𐐯𐑅𐐻
  DB Lower: test𐐓𐐇𐐝𐐓
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!