Best way to shorten UTF8 string based on byte length

前端 未结 9 1381
感情败类
感情败类 2020-12-10 12:14

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I\'m using the Oracle.DataAccess connection libra

相关标签:
9条回答
  • 2020-12-10 12:23

    Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).

    If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.

    public static String LimitByteLength(String input, Int32 maxLength)
    {
        return new String(input
            .TakeWhile((c, i) =>
                Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
            .ToArray());
    }
    
    public static String LimitByteLength2(String input, Int32 maxLength)
    {
        for (Int32 i = input.Length - 1; i >= 0; i--)
        {
            if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
            {
                return input.Substring(0, i + 1);
            }
        }
    
        return String.Empty;
    }
    
    0 讨论(0)
  • 2020-12-10 12:27

    This is another solution based on binary search:

    public string LimitToUTF8ByteLength(string text, int size)
    {
        if (size <= 0)
        {
            return string.Empty;
        }
    
        int maxLength = text.Length;
        int minLength = 0;
        int length = maxLength;
    
        while (maxLength >= minLength)
        {
            length = (maxLength + minLength) / 2;
            int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));
    
            if (byteLength > size)
            {
                maxLength = length - 1;
            }
            else if (byteLength < size)
            {
                minLength = length + 1;
            }
            else
            {
                return text.Substring(0, length); 
            }
        }
    
        // Round down the result
        string result = text.Substring(0, length);
        if (size >= Encoding.UTF8.GetByteCount(result))
        {
            return result;
        }
        else
        {
            return text.Substring(0, length - 1);
        }
    }
    
    0 讨论(0)
  • 2020-12-10 12:34

    All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.

    public static string LimitByteLength(string message, int maxLength)
    {
        if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
        {
            return message;
        }
    
        var encoder = Encoding.UTF8.GetEncoder();
        byte[] buffer = new byte[maxLength];
        char[] messageChars = message.ToCharArray();
        encoder.Convert(
            chars: messageChars,
            charIndex: 0,
            charCount: messageChars.Length,
            bytes: buffer,
            byteIndex: 0,
            byteCount: buffer.Length,
            flush: false,
            charsUsed: out int charsUsed,
            bytesUsed: out int bytesUsed,
            completed: out bool completed);
    
        // I don't think we can return message.Substring(0, charsUsed)
        // as that's the number of UTF-16 chars, not the number of codepoints
        // (think about surrogate pairs). Therefore I think we need to
        // actually convert bytes back into a new string
        return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
    }
    

    If you're using .NET Standard 2.1+, you can simplify it a bit:

    public static string LimitByteLength(string message, int maxLength)
    {
        if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
        {
            return message;
        }
    
        var encoder = Encoding.UTF8.GetEncoder();
        byte[] buffer = new byte[maxLength];
        encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
        return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
    }
    

    None of the other answers account for extended grapheme clusters, such as

    0 讨论(0)
提交回复
热议问题