AMDGPU: Overhaul and improve rcp and rsq f32 formation
The highlight change is a new denormal safe 1ulp lowering which uses rcp after using frexp to perform input scaling. This saves 2 instructions compared to other implementations which performed an explicit denormal range change. This improves the OpenCL default, and requires a flag for HIP. I don't believe there's any flag wired up for OpenMP to emit the necessary fpmath metadata. This provides several improvements and changes that were hard to separate without regressing one case or another. Disturbingly the OpenCL conformance test seems to have the reciprocal test commented out. I locally hacked it back in to test this. Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like the rcp case, we could do this in codegen if !fpmath were preserved (although we would lose some computeKnownFPClass tricks). Start requiring contract flags to form rsq. The rsq fusion actually improves the result from ~2ulp to ~1ulp. We have some older fusion in codegen which only keys off unsafe math which should be refined. Expand rsq patterns by checking for denormal inputs and pre/post multiplying like the current library code does. We also take advantage of computeKnownFPClass to avoid the scaling when we can statically prove the input cannot be a denormal. We could do the same for the rcp case, but unlike rsq a large input can underflow to denormal. We need additional upper bound exponent checks on the input in order to do the same for rcp. This rsq handling also now starts handling the negated case. We introduce rsq with an fneg. In the case the fneg doesn't fold into its user, it's a neutral change but provides improvement if it is foldable as a source modifier. Also starts respecting the arcp attribute properly, and more strictly interprets afn. We were previously interpreting afn as implying you could do the reciprocal expansion of an fdiv. The codegen handling of these also needs to be revisited. This also effectively introduces the optimization combineRepeatedFPDivisors enables, just done in the IR instead (and only for f32). This is almost across the board better. The one minor regression is for gfx6/buggy frexp case where for multiple reciprocals, we could previously reuse rematerialized constants per instance (it's neutral for a single rcp). The fdiv.fast and sqrt handling need to be revisited next. https://reviews.llvm.org/D155593
Loading
Please sign in to comment