Is there an advantage of specifying “-mfpu=neon-vfpv3” over “-mfpu=neon” for ARMs with separate pipelines?

问题

My Zynq-7000 ARM Cortex-A9 Processor has both the NEON and the VFPv3 extension and the Zynq-7000-TRM says that the processor is configured to have "Independent pipelines for VFPv3 and advanced SIMD instructions".

So far I compiled my programs with Linaro GCC 6.3-2017.05 and the -mfpu=neon option, to make use of SIMD instructions. But in the case that the compiler also has non-SIMD operations to be issued, will it make a difference to use -mfpu=neon-vfpv3? Will GCC's instruction selection and scheduler emit instructions for both versions, so that it could then make use of both pipelines, to increase utilization of the CPU?

回答1:

Technically, yes.

Reality, no.

NEON has been optional on ARMv7.

The licensees can choose one configuration from below:

none
VFP only
NEON plus VFP

Unlike NEON, there has been different VFP versions on ARMv7, the VFP-lite on Cortex-A8 being the most notorious one for not pipelining, thus extremely slow.

Therefore, it technically makes sense to specify the CPU configuration and the architecture version via compiler options so that the compilers can generate the most optimized machine codes for that particular architecture/configuration.

In reality however, the compilers these days ignore most of these build options and even directives in addition.

And that the VFP and NEON instructions are assigned to different pipelines won't help much, if at all since they both share the register bank.

Boosting NEON's performance by utilizing as many registers as possible would bring much more than let the VFP run in parallel instead.

It riddles me why and how so many people put so much trust in free compilers these days.

The best ARM compiler available is hands down ARM's that comes with the $6k+ DS-5 Ultimate Edition. Their support is excellent, but I'm not sure if it justifies the price tag.

回答2:

ARM's Cortex-A9 NEON/VFP manual (Cortex™ -A9 NEON™ Media Processing Engine) says, in section 3.2 Writing optimal VFP and Advanced SIMD code:

The following guidelines can provide significant performance increases for VFP and Advanced SIMD code: Where possible avoid:

...

mixing Advanced SIMD only instructions with VFP only instructions.

It says it can execute NEON and VFP instructions in parallel with ARM or Thumb instructions (i.e. scalar integer code), "with the exception of simultaneous loads and stores".

It's not 100% clear if they mean avoid having them in flight at once at all, or if they mean avoid having data dependencies between VFP and NEON instructions. It's easy to imagine the latter being bad for reasons that don't apply to the former (e.g. maybe no bypass forwarding between execution units in different domains).

The cycle timings in the same document indicate that VFP scalar instructions take longer in the pipeline than NEON instructions (even if the latency appears to be the same), so probably using VFP is a win for code that doesn't vectorize, even with -ffast-math. Or if I'm reading this right, NEON has lower latency MUL, so may be a win for long dependency chains.

Cortex-A9, if it has VFP, has fully-pipelined VFP FPUs. e.g.

VADD/VSUB .F (Sn) or .D (Dn) ((VFP): 1c throughput. Inputs needed on cycle 1, results ready on cycle 4. (So 4c latency?)
VADD/VSUB Dn (NEON): 1c throughput. Inputs needed on cycle 2, results ready on cycle 5 (write-back on cycle 6). (So 4c or 5c latency?, depending on what consumes the result).
VADD/VSUB Qn (NEON): (1 per) 2c throughput. Inputs needed on cycle 2 then 3, results ready on cycle 5 then 6. (Write-back 1c later than that) (So 4c or 5c latency?).
VMUL .F Sd,Sn,Sm (VFP): 1c throughput, Inputs needed on cycle 1, results ready on cycle 5. (So 5c latency?)
VMUL (VFP) with double-precision isn't listed, only VNMUL (2c throughput).
VMUL (NEON): same timings as VADD/VSUB. Maybe not handling denormals allows a shortcut? If I'm reading this right, it's actually lower latency than VFP, except for the instruction needing to issue earlier.

There's also special result-forwarding for multiply-accumulate. See the PDF.

回答3:

The answer will depend on the version of gcc, which may change in the future. The current code in cortex-a9.md decribes the NEON/VFP as being a combined unit. The line is,

(define_cpu_unit "ca9_issue_vfp_neon, cortex_a9_ls" "cortex_a9")

With comments,

;; The Cortex-A9 core is modelled as a dual issue pipeline that has
;; the following components.
;; 1. 1 Load Store Pipeline.
;; 2. P0 / main pipeline for data processing instructions.
;; 3. P1 / Dual pipeline for Data processing instructions.
;; 4. MAC pipeline for multiply as well as multiply
;;    and accumulate instructions.
;; 5. 1 VFP and an optional Neon unit.
;; The Load/Store, VFP and Neon issue pipeline are multiplexed.
;; The P0 / main pipeline and M1 stage of the MAC pipeline are
;;   multiplexed.
;; The P1 / dual pipeline and M2 stage of the MAC pipeline are
;;   multiplexed.
;; There are only 4 integer register read ports and hence at any point of
;; time we can't have issue down the E1 and the E2 ports unless
;; of course there are bypass paths that get exercised.
;; Both P0 and P1 have 2 stages E1 and E2.
;; Data processing instructions issue to E1 or E2 depending on
;; whether they have an early shift or not.

And the ca9_issue_vfp_neon unit is used to describe both NEON and VFP instructions. So the scheduler will not know that the instructions can be pipelined when costing them. However, it may emit both and you could be fortunate and they get pipelined.

In 'arm.c', there are many instances where NEON is used to transfer data. If your code has floating point with many structures, the compiler may intermix NEON and VFP code where the NEON is used to move data.

Machines like the exynos have some custom tuning like using neon for string operations that your Zync CPU will not get as it doesn't have a tuning description in arm.c.

Also, if you don't specify -mfpu=neon-vfpv3, any in-line assembler with 'vfpv3' instructions will be invalid.

Things will change depending on the GCC version. However, you can look for the CPU description in 'cortex-a9.md' to see if the compiler can possibly schedule instructions differently. Also, the 'arm.c' file performs the costing for instructions; if a NEON cost is not implemented there, then the compiler will never emit the instructions.

Having struggled with simpler ARMv5 DSP instructions, even if this was to work, you would find that only 1-2% of instructions would change. In multi-megabyte images, an option like this will only change a few hundred op-codes for the reasons that others have given (shared registers, 'C' semantics on floating point, etc).

However, if -mfpu=neon-vfpv3 does describe your CPU why would you not use it for an embedded application? The generic options are meant to generate code that can run on more than one type of device.

来源：https://stackoverflow.com/questions/47768539/is-there-an-advantage-of-specifying-mfpu-neon-vfpv3-over-mfpu-neon-for-arm

标签

gcc

assembly

arm

neon

armv7