So from my understanding of delay slots, they occur when a branch instruction is called and the next instruction following the branch also gets loaded from memory. What is t
In the textbook example of pipelined implementation, a CPU fetches, decodes, executes, and writes back. These stages all happen in different clock cycles so in effect, each instruction completes in 4 cycles. However while the first opcode is about to be decoded, the next loads from memory. When the CPU is fully occupied, there are parts of 4 different instructions handled simultaneously and the throughput of the CPU is one instruction per clock cycle.
When in the machine code there is a sequence:
sub r0, #1
bne loop
xxx
The processor can feed back information from write back stage of sub r0, #1 to execute stage of bne loop, but at the same time the xxx is already in the stage fetch. To simplify the necessity of unrolling the pipeline, the CPU designers choose to use a delay slot instead. After the instruction in delay slot is fetched, the fetch unit has the proper address of branch target. An optimizing compiler only rarely needs to put a NOP in the delay slot, but inserts there an instruction that is necessarily needed on both possible branch targets.