Disassembling write(1,\"hi\",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in:
ba03000000 mov edx, 3 ; tha
On something like the original IBM PC, if AH was known to contain 0 and it was necessary to load AX with a value like 0x34, using "MOV AL,34h" would generally take 8 cycles rather than the 12 required for "MOV AX,0034h"--a pretty big speed improvement (either instruction could execute in 2 cycles if pre-fetched, but in practice the 8088 spends most of its time waiting for instructions to be fetched at a cost of four cycles per byte). On the processors used in today's general-purpose computers, however, the time required to fetch code is generally not a significant factor in overall execution speed, and code size is normally not a particular concern.
Further, processor vendors try to maximize the performance of the kinds of code people are likely to run, and 8-bit load instructions aren't likely to be used nearly as often nowadays as 32-bit load instructions. Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.