[AArch64] Prefer to fold dup into fmul/fma as opposed to ld1r
There is a fold to create LD1DUPpost from dup(load) that can be postinc. If the dup is used by a "by element" operation such as fmul or fma then it can be slightly better to fold the dup into the fmul instead, which produces slightly fast code. ld1r { v1.4s }, [x0], #4 fmul v0.4s, v1.4s, v0.4s vs ldr s1, [x0], #4 fmul v0.4s, v0.4s, v1.s[0] This could also be done with integer operations such as smull/umull too, so long as the load/dup gets correctly combined into the mul operation. Currently this just operates on foating point types. Differential Revision: https://reviews.llvm.org/D145184
Loading
Please register or sign in to comment