What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?

前端 未结 10 1794
孤城傲影
孤城傲影 2020-11-28 03:02

I was solving some problem on codeforces. Normally I first check if the character is upper or lower English letter then subtract or add 32 to convert it to the

相关标签:
10条回答
  • 2020-11-28 03:38

    It's how ASCII works, that's all.

    But in exploiting this, you are giving up portability as C++ doesn't insist on ASCII as the encoding.

    This is why the functions std::toupper and std::tolower are implemented in the C++ standard library - you should use those instead.

    0 讨论(0)
  • 2020-11-28 03:41

    This uses the fact than ASCII values have been chosen by really smart people.

    foo ^= 32;
    

    This flips the 6th lowest bit1 of foo (the uppercase flag of ASCII sort of), transforming an ASCII upper case to a lower case and vice-versa.

    +---+------------+------------+
    |   | Upper case | Lower case |  32 is 00100000
    +---+------------+------------+
    | A | 01000001   | 01100001   |
    | B | 01000010   | 01100010   |
    |            ...              |
    | Z | 01011010   | 01111010   |
    +---+------------+------------+
    

    Example

    'A' ^ 32
    
        01000001 'A'
    XOR 00100000 32
    ------------
        01100001 'a'
    

    And by property of XOR, 'a' ^ 32 == 'A'.

    Notice

    C++ is not required to use ASCII to represent characters. Another variant is EBCDIC. This trick only works on ASCII platforms. A more portable solution would be to use std::tolower and std::toupper, with the offered bonus to be locale-aware (it does not automagically solve all your problems though, see comments):

    bool case_incensitive_equal(char lhs, char rhs)
    {
        return std::tolower(lhs, std::locale{}) == std::tolower(rhs, std::locale{}); // std::locale{} optional, enable locale-awarness
    }
    
    assert(case_incensitive_equal('A', 'a'));
    

    1) As 32 is 1 << 5 (2 to the power 5), it flips the 6th bit (counting from 1).

    0 讨论(0)
  • 2020-11-28 03:41

    It works because, as it happens, the difference between 'a' and A' in ASCII and derived encodings is 32, and 32 is also the value of the sixth bit. Flipping the 6th bit with an exclusive OR thus converts between upper and lower.

    0 讨论(0)
  • 2020-11-28 03:41

    The lower-case and upper-case alphabetic ranges don't cross a %32 "alignment" boundary in the ASCII coding system.

    This is why bit 0x20 is the only difference between the upper/lower case versions of the same letter.

    If this wasn't the case, you'd need to add or subtract 0x20, not just toggle, and for some letters there would be carry-out to flip other higher bits. (And there wouldn't be a single operation that could toggle, and checking for alphabetic characters in the first place would be harder because you couldn't |= 0x20 to force lcase.)


    Related ASCII-only tricks: you can check for an alphabetic ASCII character by forcing lowercase with c |= 0x20 and then checking if (unsigned) c - 'a' <= ('z'-'a'). So just 3 operations: OR + SUB + CMP against a constant 25. Of course, compilers know how to optimize (c>='a' && c<='z') into asm like this for you, so at most you should do the c|=0x20 part yourself. It's rather inconvenient to do all the necessary casting yourself, especially to work around default integer promotions to signed int.

    unsigned char lcase = y|0x20;
    if (lcase - 'a' <= (unsigned)('z'-'a')) {   // lcase-'a' will wrap for characters below 'a'
        // c is alphabetic ASCII
    }
    // else it's not
    

    See also Convert a String In C++ To Upper Case (SIMD string toupper for ASCII only, masking the operand for XOR using that check.)

    And also How to access a char array and change lower case letters to upper case, and vice versa (C with SIMD intrinsics, and scalar x86 asm case-flip for alphabetic ASCII characters, leaving others unmodified.)


    These tricks are mostly only useful if hand-optimizing some text-processing with SIMD (e.g. SSE2 or NEON), after checking that none of the chars in a vector have their high bit set. (And thus none of the bytes are part of a multi-byte UTF-8 encoding for a single character, which might have different upper/lower-case inverses). If you find any, you can fall back to scalar for this chunk of 16 bytes, or for the rest of the string.

    There are even some locales where toupper() or tolower() on some characters in the ASCII range produce characters outside that range, notably Turkish where I ↔ ı and İ ↔ i. In those locales, you'd need a more sophisticated check, or probably not trying to use this optimization at all.


    But in some cases, you're allowed to assume ASCII instead of UTF-8, e.g. Unix utilities with LANG=C (the POSIX locale), not en_CA.UTF-8 or whatever.

    But if you can verify it's safe, you can toupper medium-length strings much faster than calling toupper() in a loop (like 5x), and last I tested with Boost 1.58, much much faster than boost::to_upper_copy<char*, std::string>() which does a stupid dynamic_cast for every character.

    0 讨论(0)
提交回复
热议问题