How can I check if a character is a letter in assembly?

这一生的挚爱 提交于 2020-12-04 11:57:08

问题


So, I have a block of code which sets the bounders to check if a character is a letter (not numbers, not symbols), but I don't think it works for the characters in between upper and lower case. Can you help? Thanks!

mov al, byte ptr[esi + ecx]; move the first character to al
cmp al, 0                  ; compare al with null which is the end of string
je done                    ; if yes, jump to done
cmp al, 0x41               ; compare al with "A" (upper bounder)
jl next_char               ; jump to next character if less
cmp al, 0x7A               ; compare al with "z" (lower bounder)
jg next_char               ; jump to next character if greater
//do something if it's a letter
next_char:
//do something different

回答1:


You need to have a logic that combines multiple conditions similar to what would be a "C" statement: if((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z'))

You can do that like this:

...
je done                    ; if yes, jump to done
cmp al, 0x41               ; compare al with "A"
jl next_char               ; jump to next character if less
cmp al, 0x5A               ; compare al with "Z"
jle found_letter           ; if al is >= "A" && <= "Z" -> found a letter
cmp al, 0x61               ; compare al with "a"
jl next_char               ; jump to next character if less (since it's between "Z" & "a")
cmp al, 0x7A               ; compare al with "z"
jg next_char               ; above "Z" -> not a character
found_letter:
// ...
next_char:
// ...



回答2:


You may or 0x20 to each character; this will make upper-case letters lower-case (and replace non-letter characters by other non-letter characters):

...
je done       ; This is your existing code
or al, 0x20   ; <-- This line is new!
cmp al, 0x41  ; This is your existing code again
...

Note: If your code should work with letters above 0x7F (like "Ä", "Ó", "Ñ") it would become very complex. One problem in this case would be that the ASCII code of these characters is different in Windows console programs (Example: "Ä" = 0x8E) and Windows GUI programs ("Ä" = 0xC4) and may be even different in other operating systems...




回答3:


Correct, there's a gap of a few non-alphabetic characters between 'Z' and 'a'.

The most efficient way is to set the lower-case bit with an OR, then use the range-check trick of sub + unsigned compare. This of course only works for ASCII, not extended character sets where there are other ranges of alphabetic characters. Note that or al, 0x20 can never create a lower-case character if the original wasn't an upper-case character, because the ranges are "aligned" the same relative to a mod 32 boundary of ASCII codes.

Arrange your loop structure with the conditional branch at the bottom. Either enter the loop with a jmp to that load and test, or peel that part of the first iteration. (Why are loops always compiled into "do...while" style (tail jump)?)

Use movzx loads to avoid a false dependency on merging a low byte into EAX when writing AL.

 ; ESI = pointer to the string
    xor    ecx, ecx            ; index = 0
    movzx  eax, byte ptr[esi]  ; test first character
    test   eax, eax
    jz    .done                ; skip the loop on empty string
 ; alternative: jmp .next_char to enter the loop
.loop:                         ; do{
    inc    ecx

    mov    edx, eax               ; save a copy of the original if needed
;;;; THESE 4 INSTRUCTIONS ARE THE ALPHA / NON-ALPHA TEST
    or     al, 0x20               ; force lowercase
    sub    al, 'a'                ; AL = 0..25 if alphabetic
    cmp    al, 'z'-'a'
    ja    .non_alphabetic         ; unsigned compare rejects too high or too low (wrapping)

;; do something if it's a letter
    jmp   .next_char
.non_alphabetic:
;; do something different, then fall through

.next_char:
    movzx  eax, byte ptr[esi + ecx]
    test   eax, eax
    jnz    .loop                 ; }while((AL = str[i]) != 0);

.done:

If the input is before 'a', sub al, 'a' will be signed negative, or as unsigned will wrap to a high value, so cmp al, 'z'-'a' / ja will reject it.

If the input is after 'z', sub al, 'a' will leave a value higher than 25 ('z'-'a'), so the unsigned compare will reject it also.

Compilers use this unsigned compare trick when compiling a C expression like c <= 'z' && c >= 'a', so you can be sure it works the same as that expression for every possible input.

Other style notes: normally you'd just increment ESI, instead of having both a pointer and an index. Also, you may not need mov edx, eax if you can use the AL value (index into the alphabet). Making a copy and using this "destructive" test is usually better than 2 separate branches.


NASM syntax allows character constants like C, so you can write as 'A', or 0x7Aas'z'. e.g. cmp al, 'a'`. Then you don't even need to comment the line.

Writing it that way (with the next_char label at the top of the loop) saves a jmp at the bottom. Fewer instructions in the loop = better. The only point of writing asm these days is performance, so it makes sense to learn good techniques like this from the start, if it's not too confusing. No assembly answer would be complete without a link to http://agner.org/optimize/.

output of ascii(1), or http://www.asciitable.com/

Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex  
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL



回答4:


This function takes a string, and uses ascii table values to determine if it is an upper case char or lower case char. The CMP-->BLS and CMP-->BLI instructions are what determine if it's an upper or lower case char. The code that comes afterwards capitalizes the char if it is a lower case char.

__asm void my_capitalize(char *str)
{
cap_loop
        LDRB r1, [r0] ; Load byte into r1 from memory pointed to by r0 (str pointer)
        CMP r1, #'a'-1 ; compare it with the character before 'a'
        BLS cap_skip ; If byte is lower or same, then skip this byte
        CMP r1, #'z' ; Compare it with the 'z' character
        BHI cap_skip ; If it is higher, then skip this byte
        SUBS r1,#32 ; Else subtract out difference to capitalize it
        STRB r1, [r0] ; Store the capitalized byte back in memory
cap_skip
        ADDS r0, r0, #1 ; Increment str pointer
        CMP r1, #0 ; Was the byte 0?
        BNE cap_loop ; If not, repeat the loop
        BX lr ; Else return from subroutine
}


来源:https://stackoverflow.com/questions/31824441/how-can-i-check-if-a-character-is-a-letter-in-assembly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!