[SLP] Better estimate cost of no-op extracts on target vectors.
The motivation for this patch is to better estimate the cost of extracelement instructions in cases were they are going to be free, because the source vector can be used directly. A simple example is %v1.lane.0 = extractelement <2 x double> %v.1, i32 0 %v1.lane.1 = extractelement <2 x double> %v.1, i32 1 %a.lane.0 = fmul double %v1.lane.0, %x %a.lane.1 = fmul double %v1.lane.1, %y Currently we only consider the extracts free, if there are no other users. In this particular case, on AArch64 which can fit <2 x double> in a vector register, the extracts should be free, independently of other users, because the source vector of the extracts will be in a vector register directly, so it should be free to use the vector directly. The SLP vectorized version of noop_extracts_9_lanes is 30%-50% faster on certain AArch64 CPUs. It looks like this does not impact any code in SPEC2000/SPEC2006/MultiSource both on X86 and AArch64 with -O3 -flto. This originally regressed after D80773, so if there's a better alternative to explore, I'd be more than happy to do that. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D99719
Loading
Please register or sign in to comment