Best way to shorten UTF8 string based on byte length

前端 未结 9 1428
感情败类
感情败类 2020-12-10 12:14

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I\'m using the Oracle.DataAccess connection libra

9条回答
  •  渐次进展
    2020-12-10 12:34

    All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.

    public static string LimitByteLength(string message, int maxLength)
    {
        if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
        {
            return message;
        }
    
        var encoder = Encoding.UTF8.GetEncoder();
        byte[] buffer = new byte[maxLength];
        char[] messageChars = message.ToCharArray();
        encoder.Convert(
            chars: messageChars,
            charIndex: 0,
            charCount: messageChars.Length,
            bytes: buffer,
            byteIndex: 0,
            byteCount: buffer.Length,
            flush: false,
            charsUsed: out int charsUsed,
            bytesUsed: out int bytesUsed,
            completed: out bool completed);
    
        // I don't think we can return message.Substring(0, charsUsed)
        // as that's the number of UTF-16 chars, not the number of codepoints
        // (think about surrogate pairs). Therefore I think we need to
        // actually convert bytes back into a new string
        return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
    }
    

    If you're using .NET Standard 2.1+, you can simplify it a bit:

    public static string LimitByteLength(string message, int maxLength)
    {
        if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
        {
            return message;
        }
    
        var encoder = Encoding.UTF8.GetEncoder();
        byte[] buffer = new byte[maxLength];
        encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
        return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
    }
    

    None of the other answers account for extended grapheme clusters, such as

提交回复
热议问题