First, the processor does not need to know how many bytes to fetch, it can fetch a convenient number of bytes sufficient to provide the targeted throughput for typical or average instruction lengths. Any extra bytes can be place in a buffer to be used in the next group of bytes to be decoded. There are tradeoffs in the width and alignment of fetch relative to the supported width of instruction decode and even with respect to the width of later parts of the pipeline. Fetching more bytes than average can reduce the impact of variability in instruction length and the effective fetch bandwidth related to taken control flow instructions.
(Taken control flow instructions may introduce a fetch bubble if the [predicted] target is not available until a cycle after the next fetch and reduce effective fetch bandwidth with targets that are less aligned than the instruction fetch. E.g., if instruction fetch is 16-byte aligned—as is common for high performance x86—a taken branch that targets the 16th [last] byte in a chunk will result in effectively only one byte of code being fetched as the other 15 bytes are discarded.)
Even for fixed length instructions, fetching multiple instructions per cycle introduces similar issues. Some implementations (e.g., MIPS R10000) would fetch as many instructions as could be decoded even if they are not aligned, as long as the group of instructions does not cross a cache line boundary. (I seem to recall that one RISC implementation two banks of Icache tags to allow fetch to cross a cache block—but not page—boundary.) Other implementations (e.g., POWER4) would fetch aligned chunks of code even for a branch targeting the last instruction in such a chunk. (For POWER4, 32 byte chunks were used containing 8 instructions but at most five instructions could pass decode per cycle. This excess fetch width could be exploited to save energy via cycles where no fetch is performed and to give spare Icache cycles for cache block filling after a miss while only having one read/write port to the Icache.)
For decoding multiple instructions per cycle, there are effectively two strategies: speculatively decode in parallel or wait for the length to be determined and use that information to parse the instruction stream into separate instructions. For an ISA like IBM's zArchitecture (S/360 descendant), the length in 16-bit parcels is trivially determined by two bits in the first parcel, so waiting for the lengths to be determined makes more sense. (RISC V's slightly more complex length indication mechanism would still be friendly to non-speculative decode.) For an encoding like that of microMIPS or Thumb2, which only have two lengths determinable by the major opcode and for which the encoding of different length instructions is substantially different, using non-speculative decode may be preferred, especially given the likely narrow decode and emphasis on energy-efficiency, though with only two lengths some speculation may be reasonable at small decode width.
For x86, one strategy used by AMD to avoid excessive decode energy use is to use marker bits in the instruction cache indicating which byte ends an instruction. With such marker bits, it is simple to find the start of each instruction. This technique has the disadvantage that it adds to the latency of an instruction cache miss (the instructions must be predecoded) and it still requires the decoders to check that the lengths are correct (e.g., in case a jump is made into what was previously the middle of an instruction).
Intel seems to prefer the speculative parallel decode approach. Since the length of a previous instruction in a chunk to be decoded will be available after only modest delay, the second and later decoders may not need to fully decode the instruction for all starting points.
Since x86 instructions can be relatively complex, there are also often decode template constraints and at least one earlier design restricted the number of prefixes that could be used while maintaining full decode bandwidth. E.g., Haswell limits the second through fourth instructions decoded to producing only one µop while the first instruction can decode into up to four µops (with longer µop sequences using a microcode engine). Basically, this is an optimization for the common case (relatively simple instructions) at the expense of the less common case.
In more recent performance-oriented x86 designs, Intel has used a µop cache which stores instructions in decoded format avoiding template and fetch width constraints and reducing energy use associated with decoding.