问题
How do I truncate a java String
so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?
回答1:
Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:
public static String truncateWhenUTF8(String s, int maxBytes) {
int b = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
// ranges from http://en.wikipedia.org/wiki/UTF-8
int skip = 0;
int more;
if (c <= 0x007f) {
more = 1;
}
else if (c <= 0x07FF) {
more = 2;
} else if (c <= 0xd7ff) {
more = 3;
} else if (c <= 0xDFFF) {
// surrogate area, consume next char as well
more = 4;
skip = 1;
} else {
more = 3;
}
if (b + more > maxBytes) {
return s.substring(0, i);
}
b += more;
i += skip;
}
return s;
}
This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8()
will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.
I haven't done a lot of testing on that code, but here are some preliminary tests:
private static void test(String s, int maxBytes, int expectedBytes) {
String result = truncateWhenUTF8(s, maxBytes);
byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
if (utf8.length > maxBytes) {
System.out.println("BAD: our truncation of " + s + " was too big");
}
if (utf8.length != expectedBytes) {
System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
}
System.out.println(s + " truncated to " + result);
}
public static void main(String[] args) {
test("abcd", 0, 0);
test("abcd", 1, 1);
test("abcd", 2, 2);
test("abcd", 3, 3);
test("abcd", 4, 4);
test("abcd", 5, 4);
test("a\u0080b", 0, 0);
test("a\u0080b", 1, 1);
test("a\u0080b", 2, 1);
test("a\u0080b", 3, 3);
test("a\u0080b", 4, 4);
test("a\u0080b", 5, 4);
test("a\u0800b", 0, 0);
test("a\u0800b", 1, 1);
test("a\u0800b", 2, 1);
test("a\u0800b", 3, 1);
test("a\u0800b", 4, 4);
test("a\u0800b", 5, 5);
test("a\u0800b", 6, 5);
// surrogate pairs
test("\uD834\uDD1E", 0, 0);
test("\uD834\uDD1E", 1, 0);
test("\uD834\uDD1E", 2, 0);
test("\uD834\uDD1E", 3, 0);
test("\uD834\uDD1E", 4, 4);
test("\uD834\uDD1E", 5, 4);
}
Updated Modified code example, it now handles surrogate pairs.
回答2:
You should use CharsetEncoder, the simple getBytes()
+ copy as many as you can can cut UTF-8 charcters in half.
Something like this:
public static int truncateUtf8(String input, byte[] output) {
ByteBuffer outBuf = ByteBuffer.wrap(output);
CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());
Charset utf8 = Charset.forName("UTF-8");
utf8.newEncoder().encode(inBuf, outBuf, true);
System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
return outBuf.position();
}
回答3:
Here's what I came up with, it uses standard Java APIs so should be safe and compatible with all the unicode weirdness and surrogate pairs etc. The solution is taken from http://www.jroller.com/holy/entry/truncating_utf_string_to_the with checks added for null and for avoiding decoding when the string is fewer bytes than maxBytes.
/**
* Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
* half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
* character.
*
* Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
*/
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
if (s == null) {
return null;
}
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
byte[] sba = s.getBytes(charset);
if (sba.length <= maxBytes) {
return s;
}
// Ensure truncation by having byte buffer = maxBytes
ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
CharBuffer cb = CharBuffer.allocate(maxBytes);
// Ignore an incomplete character
decoder.onMalformedInput(CodingErrorAction.IGNORE)
decoder.decode(bb, cb, true);
decoder.flush(cb);
return new String(cb.array(), 0, cb.position());
}
回答4:
UTF-8 encoding has a neat trait that allows you to see where in a byte-set you are.
check the stream at the character limit you want.
- If its high bit is 0, it's a single-byte char, just replace it with 0 and you're fine.
- If its high bit is 1 and so is the next bit, then you're at the start of a multi-byte char, so just set that byte to 0 and you're good.
- If the high bit is 1 but the next bit is 0, then you're in the middle of a character, travel back along the buffer until you hit a byte that has 2 or more 1s in the high bits, and replace that byte with 0.
Example: If your stream is: 31 33 31 C1 A3 32 33 00, you can make your string 1, 2, 3, 5, 6, or 7 bytes long, but not 4, as that would put the 0 after C1, which is the start of a multi-byte char.
回答5:
you can use -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");
回答6:
You can calculate the number of bytes without doing any conversion.
foreach character in the Java string
if 0 <= character <= 0x7f
count += 1
else if 0x80 <= character <= 0x7ff
count += 2
else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
count += 3
else if 0xdc00 <= character <= 0xffff
count += 3
else { // surrogate, a bit more complicated
count += 4
skip one extra character in the input stream
}
You would have to detect surrogate pairs (D800-DBFF and U+DC00–U+DFFF) and count 4 bytes for each valid surrogate pair. If you get the first value in the first range and the second in the second range, it's all ok, skip them and add 4. But if not, then it is an invalid surrogate pair. I am not sure how Java deals with that, but your algorithm will have to do right counting in that (unlikely) case.
来源:https://stackoverflow.com/questions/119328/how-do-i-truncate-a-java-string-to-fit-in-a-given-number-of-bytes-once-utf-8-en