Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?

前端 未结 2 410
我寻月下人不归
我寻月下人不归 2021-01-12 05:56

In Rust it\'s possible to get UTF-8 from bytes by doing this:

if let Ok(s) = str::from_utf8(some_u8_slice) {
    println!(\"example {}\", s);
}
2条回答
  •  感情败类
    2021-01-12 06:11

    Yes, via String::from_utf8_lossy:

    fn main() {
        let text = [104, 101, 0xFF, 108, 111];
        let s = String::from_utf8_lossy(&text);
        println!("{}", s); // he�lo
    }
    

    If you need more control over the process, you can use std::str::from_utf8, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.

    A quickly hacked-up example:

    use std::str;
    
    fn example(mut bytes: &[u8]) -> String {
        let mut output = String::new();
    
        loop {
            match str::from_utf8(bytes) {
                Ok(s) => {
                    // The entire rest of the string was valid UTF-8, we are done
                    output.push_str(s);
                    return output;
                }
                Err(e) => {
                    let (good, bad) = bytes.split_at(e.valid_up_to());
    
                    if !good.is_empty() {
                        let s = unsafe {
                            // This is safe because we have already validated this
                            // UTF-8 data via the call to `str::from_utf8`; there's
                            // no need to check it a second time
                            str::from_utf8_unchecked(good)
                        };
                        output.push_str(s);
                    }
    
                    if bad.is_empty() {
                        //  No more data left
                        return output;
                    }
    
                    // Do whatever type of recovery you need to here
                    output.push_str("");
    
                    // Skip the bad byte and try again
                    bytes = &bad[1..];
                }
            }
        }
    }
    
    fn main() {
        let r = example(&[104, 101, 0xFF, 108, 111]);
        println!("{}", r); // helo
    }
    

    You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:

    fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
        // ...    
                    handler(&mut output, bad);
        // ...
    }
    
    let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
        use std::fmt::Write;
        write!(output, "\\U{{{}}}", bytes[0]).unwrap()
    });
    println!("{}", r); // he\U{255}lo
    

    See also:

    • How do I convert a Vector of bytes (u8) to a string
    • How to print a u8 slice as text if I don't care about the particular encoding?.

提交回复
热议问题