One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level
I have read all the answers (more than 30) and didn't find a simple reason: assembler is faster than C if you have read and practiced the Intel® 64 and IA-32 Architectures Optimization Reference Manual, so the reason why assembly may be slower is that people who write such slower assembly didn't read the Optimization Manual.
In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles, but since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced 1993, there were U and V pipelines: dual pipe lines that could execute two simple instructions at one clock cycle if they didn't depend on one another; but this was nothing to compare of what is Out-of-Order Execution & Register Renaming appeared in Pentium Pro, and almost left unchanged nowadays.
To explain in a few words, fastest code is where instructions do not depend on previous results, e.g. you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may renamed internally by the CPU to allow instruction execute in parallel or in different order.
Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may with to use add eax, 1
instead or inc eax
to remove dependency on previous state of the flags.
You can read more on Out-of-Order Execution & Register Renaming if time permits, there is plenty information available in the Internet.
There are also other important issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the most important thing to consider is namely the Out-of-Order Execution.
Most people are simply not aware about the Out-of-Order Execution, so they write their assembly programs like for 80286, expecting their instruction will take a fixed time to execute regardless of context; while C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such unaware people is slower, but if you will become aware, your code will be faster.