What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?

前端 未结 10 1793
孤城傲影
孤城傲影 2020-11-28 03:02

I was solving some problem on codeforces. Normally I first check if the character is upper or lower English letter then subtract or add 32 to convert it to the

相关标签:
10条回答
  • 2020-11-28 03:20

    Xoring with 32 (00100000 in binary) sets or resets the sixth bit (from the right). This is strictly equivalent to adding or subtracting 32.

    0 讨论(0)
  • 2020-11-28 03:26

    See the second table at http://www.catb.org/esr/faqs/things-every-hacker-once-knew/#_ascii, and following notes, reproduced below:

    The Control modifier on your keyboard basically clears the top three bits of whatever character you type, leaving the bottom five and mapping it to the 0..31 range. So, for example, Ctrl-SPACE, Ctrl-@, and Ctrl-` all mean the same thing: NUL.

    Very old keyboards used to do Shift just by toggling the 32 or 16 bit, depending on the key; this is why the relationship between small and capital letters in ASCII is so regular, and the relationship between numbers and symbols, and some pairs of symbols, is sort of regular if you squint at it. The ASR-33, which was an all-uppercase terminal, even let you generate some punctuation characters it didn’t have keys for by shifting the 16 bit; thus, for example, Shift-K (0x4B) became a [ (0x5B)

    ASCII was designed such that the shift and ctrl keyboard keys could be implemented without much (or perhaps any for ctrl) logic - shift probably required only a few gates. It probably made at least as much sense to store the wire protocol as any other character encoding (no software conversion required).

    The linked article also explains many strange hacker conventions such as And control H does a single character and is an old^H^H^H^H^H classic joke. (found here).

    0 讨论(0)
  • 2020-11-28 03:30

    Most likely your implementation of the character set will be ASCII. If we look at the table:

    We see that there's a difference of exactly 32 between the value of a lowercase and uppercase number. Therefore, if we do ^= 32 (which equates to toggling the 6th least significant bit), it changes between a lowercase and uppercase character.

    Note that it works with all the symbols, not just the letters. It toggles a character with the respective character where the 6th bit is different, resulting in a pair of characters that is toggled back and forth between. For the letters, the respective upper/lowercase characters form such a pair. A NUL will change into Space and the other way around, and the @ toggles with the backtick. Basically any character in the first column on this chart toggles with the character one column over, and the same applies to the third and fourth columns.

    I wouldn't use this hack though, as there's not guarantee that it's going to work on any system. Just use toupper and tolower instead, and queries such as isupper.

    0 讨论(0)
  • 2020-11-28 03:34

    Let's take a look at ASCII code table in binary.

    A 1000001    a 1100001
    B 1000010    b 1100010
    C 1000011    c 1100011
    ...
    Z 1011010    z 1111010
    

    And 32 is 0100000 which is the only difference between lowercase and uppercase letters. So toggling that bit toggles the case of a letter.

    0 讨论(0)
  • 2020-11-28 03:37

    Plenty of good answers here that describe how this works, but why it works this way is to improve performance. Bitwise operations are faster than most other operations within a processor. You can quickly do a case insensitive comparison by simply not looking at the bit that determines case or change case to upper/lower simply by flipping the bit (those guys that designed the ASCII table were pretty smart).

    Obviously, this isn't nearly as big of a deal today as it was back in 1960 (when work first began on ASCII) due to faster processors and Unicode, but there are still some low-cost processors that this could make a significant difference as long as you can guarantee only ASCII characters.

    https://en.wikipedia.org/wiki/Bitwise_operation

    On simple low-cost processors, typically, bitwise operations are substantially faster than division, several times faster than multiplication, and sometimes significantly faster than addition.

    NOTE: I would recommend using standard libraries for working with strings for a number of reasons (readability, correctness, portability, etc). Only use bit flipping if you have measured performance and this is your bottleneck.

    0 讨论(0)
  • 2020-11-28 03:38

    Allow me to say that this is -- although it seems smart -- a really, really stupid hack. If someone recommends this to you in 2019, hit him. Hit him as hard as you can.
    You can, of course, do it in your own software that you and nobody else uses if you know that you will never use any language but English anyway. Otherwise, no go.

    The hack was arguable "OK" some 30-35 years ago when computers didn't really do much but English in ASCII, and maybe one or two major European languages. But... no longer so.

    The hack works because US-Latin upper- and lowercases are exactly 0x20 apart from each other and appear in the same order, which is just one bit of difference. Which, in fact, this bit hack, toggles.

    Now, the people creating code pages for Western Europe, and later the Unicode consortium, were smart enough to keep this scheme for e.g. German Umlauts and French-accented Vowels. Not so for ß which (until someone convinced the Unicode consortium in 2017, and a large Fake News print magazine wrote about it, actually convincing the Duden -- no comment on that) don't even exist as a versal (transforms to SS). Now it does exist as versal, but the two are 0x1DBF positions apart, not 0x20.

    The implementors were, however, not considerate enough to keep this going. For example, if you apply your hack in some East European languages or the like (I wouldn't know about Cyrillic), you will get a nasty surprise. All those "hatchet" characters are examples of that, lowercase and uppercase are one apart. The hack thus does not work properly there.

    There's much more to consider, for example, some characters do not simply transform from lower- to uppercase at all (they're replaced with different sequences), or they may change form (requiring different code points).

    Do not even think about what this hack will do to stuff like Thai or Chinese (it'll just give you complete nonsense).

    Saving a couple of hundred CPU cycles may have been very worthwhile 30 years ago, but nowadays, there is really no excuse for converting a string properly. There are library functions for performing this non-trivial task.
    The time taken to convert several dozens kilobytes of text properly is negligible nowadays.

    0 讨论(0)
提交回复
热议问题