问题
I think this is a pretty common task and there should be some fast and concise solution. I have a quadword and I want to get the lowest byte position which is equal to 0x0A (line break in Linux). I wrote the following simple program:
SYS_exit equ 0x3C
section .text
global _start
_start:
mov rax, 0x0A
mov rbx, [dt]
mov rcx, 0x07
loop:
mov r13, rbx
and r13, 0xFF
cmp r13, 0x0A
jz ok
shr rbx, 8
dec rcx
jnz loop
jmp fail
ok:
mov rax, SYS_exit
mov rdi, 8
sub rdi, rcx
syscall
fail:
mov rax, SYS_exit
mov rdi, -1
syscall
section .data
dt: dq 0xAB97450A8733AA1F
And it works pretty much fine. strace ./bin prints
execve("./bin", ["./bin"], [/* 69 vars */]) = 0
exit(5) = ?
+++ exited with 5 +++
But the program looks ugly and actually I'm looking for a way to make this as fast as possible. Can you give some optimization advice?
回答1:
But the program looks ugly
Congratulations for noticing :P
I'm looking for a way to make this as fast as possible
SSE2 is baseline for x86-64, so you should use it. You can do this in a couple instructions, using pcmpeqb / pmovmskb to get a bitmap of byte-compare results, then use a bit-scan instruction like bsr (scan reverse gives you the index of the highest set bit).
default rel ; don't forget this: RIP-relative addressing is best for loading/storing global data
_start:
movq xmm0, [dt] ; movq xmm0, rdx or whatever works too.
pcmpeqb xmm0, [newline_mask] ; -1 for match, 0 for no-match
pmovmskb edi, xmm0
bsr edi, edi ; index of highest set bit
mov eax, SYS_exit
jz .not_found ; BSR sets ZF if the *input* was zero
; [dt+rdi] == 0xA
syscall ; exit(0..7)
.not_found:
mov edi, -1 ; exit only cares about the low byte of its arg; a 64-bit -1 is pointless.
syscall
section .rodata
align 16
newline_mask: times 16 db 0x0a
section .data
dt: dq 0xAB97450A8733AA1F
Obviously in a loop you'd keep newline_mask in a register (and then you can broadcast-load it with AVX vbroadcastss, or SSE3 movddup, instead of needing a whole 16 byte constant in memory).
And of course you can do this for 16 bytes at a time with a movdqu load, or 32 bytes at a time with AVX2. If you have a large buffer, you're basically implementing a backwards memcmp and should look at optimized library implementations. They might combine pcmpeqb results for a whole cache line with por, so they save 3/4 of the pmovmskb work until the end when they sort out which part of the cache line had the hit.
If you care about AMD CPUs (where bsr is slow), maybe separately test for input=0 with test edi,edi / jz before using tzcnt. (tzcnt(x) gives you 31-bsr(x), or 32 if the input was all-zero.) If you can depend on BMI2 being available...
If you wanted to do it with a scalar loop, you could use byte compares on the low byte of a register instead of copying and masking the value.
; we test byte 7 first, so start the counter there.
mov edi, 7 ; no idea why you were using a 64-bit counter register
; loop body runs with edi=7..0
.loop: ; do{
rol rbx, 8 ; high byte becomes low
cmp bl, 0xa ; check the low byte
je .found
dec edi
jge .loop ; } while(--edi>=0) signed compare
; not found falls through with edi=-1
.found:
mov eax, SYS_exit
syscall ; exit(7..0) for found, or exit(-1) for not-found
Depending on what you're doing with the result, you might arrange your loop counter differently.
来源:https://stackoverflow.com/questions/50496029/get-the-last-line-separator