- Jun 05, 2022
-
-
Kazu Hirata authored
-
Christian Sigg authored
Recommit: "[MLIR][NVVM] Replace fdiv on fp16 with promoted (fp32) multiplication with reciprocal plus one (conditional) Newton iteration." This change rolls bcfc0a90 forward (i.e., reverting 369ce54b) with fixed CMakeLists.txt.
-
Fangrui Song authored
-
Kazu Hirata authored
-
Kazu Hirata authored
-
Kazu Hirata authored
-
Kazu Hirata authored
-
Kazu Hirata authored
-
Kazu Hirata authored
-
Kazu Hirata authored
-
LemonBoy authored
Differential Revision: https://reviews.llvm.org/D101694
-
- Jun 04, 2022
-
-
Florian Hahn authored
Instead of setting the successor to the exit using CFG.ExitBB, set it to nullptr initially. The successor to the exit block is later set either through createEmptyBasicBlock or after VPlan execution (because at the moment, no block is created by VPlan for the exit block, the existing one is reused). This also enables BranchOnCond to be used as terminator for the exiting block of the topmost vector region. Depends on D126618. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D126679
-
Jacques Pienaar authored
Avoids "pass state was never initialized" assertion failure.
-
Peter Klausler authored
Diagnose OPEN(FILE=f) when f is already connected by the same name to a distinct external I/O unit. Differential Revision: https://reviews.llvm.org/D127035
-
Peter Klausler authored
Add extra arguments and checks to the runtime support library so that a call to the intrinsic functions MOD and MODULO with "denominator" argument P of zero will cause a crash with a source location rather than an uninformative floating-point error or integer division by zero signal. Additional work is required in lowering to (1) pass source file path and source line number arguments and (2) actually call these runtime library APIs instead of emitting inline code for MOD &/or MODULO. Differential Revision: https://reviews.llvm.org/D127034
-
Peter Klausler authored
When an external I/O statement is in a recoverable error state before any data transfers take place (for example, an unformatted transfer with ERR=/IOSTAT=/IOMSG= attempted on a formatted unit), ensure that the unit's mutex is still released at the end of the statement. Differential Revision: https://reviews.llvm.org/D127032
-
Peter Klausler authored
For example, FINDLOC(A,X) should convert both A and X to COMPLEX(8) if the operands are REAL(8) and COMPLEX(4), so that comparisons can be done without losing inforation. The current implementation unconditionally converts X to the type of the array A. Differential Revision: https://reviews.llvm.org/D127030
-
Peter Klausler authored
The initial size of the file was not being captured as the file position on which the first output buffer should be framed. Differential Revision: https://reviews.llvm.org/D127029
-
Peter Klausler authored
The "engineering" ENw.d output editing descriptor has some difficult edge case behavior for values that might format into a bunch of 9's or round up to a 1 for a given scale factor. Fix the algorithm, and add tests to protect against regressions. Differential Revision: https://reviews.llvm.org/D127028
-
Peter Klausler authored
Avoid calls to memcpy with zero byte counts if their address argument calculations may not be valid expressions. Differential Revision: https://reviews.llvm.org/D127027
-
Peter Klausler authored
After the program has survived its attempt to overflow the output buffer with an internal WRITE using ERR=, IOSTAT=, &/or IOMSG=, don't crash by accidentally blank-filling the next record that usually doesn't exist. Differential Revision: https://reviews.llvm.org/D127024
-
Peter Klausler authored
When the current seed of the pseudo-random generator is queried with CALL RANDOM_SEED(GET=n), that query should not change the stream of pseudo-random numbers produced by CALL RANDOM_NUMBER(). Differential Revision: https://reviews.llvm.org/D127023
-
Mehdi Amini authored
Revert "[MLIR][GPU] Replace fdiv on fp16 with promoted (fp32) multiplication with reciprocal plus one (conditional) Newton iteration." This reverts commit bcfc0a90. The build is broken with shared library enabled.
-
Fangrui Song authored
Similar to 557efc9a. This commit handles options where cl::ZeroOrMore is more than one line below cl::opt.
-
Christian Sigg authored
[MLIR][GPU] Replace fdiv on fp16 with promoted (fp32) multiplication with reciprocal plus one (conditional) Newton iteration. This is correct for all values, i.e. the same as promoting the division to fp32 in the NVPTX backend. But it is faster (~10% in average, sometimes more) because: - it performs less Newton iterations - it avoids the slow path for e.g. denormals - it allows reuse of the reciprocal for multiple divisions by the same divisor Test program: ``` #include <stdio.h> #include "cuda_fp16.h" // This is a variant of CUDA's own __hdiv which is fast than hdiv_promote below // and doesn't suffer from the perf cliff of div.rn.fp32 with 'special' values. __device__ half hdiv_newton(half a, half b) { float fa = __half2float(a); float fb = __half2float(b); float rcp; asm("{rcp.approx.ftz.f32 %0, %1;\n}" : "=f"(rcp) : "f"(fb)); float result = fa * rcp; auto exponent = reinterpret_cast<const unsigned&>(result) & 0x7f800000; if (exponent != 0 && exponent != 0x7f800000) { float err = __fmaf_rn(-fb, result, fa); result = __fmaf_rn(rcp, err, result); } return __float2half(result); } // Surprisingly, this is faster than CUDA's own __hdiv. __device__ half hdiv_promote(half a, half b) { return __float2half(__half2float(a) / __half2float(b)); } // This is an approximation that is accurate up to 1 ulp. __device__ half hdiv_approx(half a, half b) { float fa = __half2float(a); float fb = __half2float(b); float result; asm("{div.approx.ftz.f32 %0, %1, %2;\n}" : "=f"(result) : "f"(fa), "f"(fb)); return __float2half(result); } __global__ void CheckCorrectness() { int i = threadIdx.x + blockIdx.x * blockDim.x; half x = reinterpret_cast<const half&>(i); for (int j = 0; j < 65536; ++j) { half y = reinterpret_cast<const half&>(j); half d1 = hdiv_newton(x, y); half d2 = hdiv_promote(x, y); auto s1 = reinterpret_cast<const short&>(d1); auto s2 = reinterpret_cast<const short&>(d2); if (s1 != s2) { printf("%f (%u) / %f (%u), got %f (%hu), expected: %f (%hu)\n", __half2float(x), i, __half2float(y), j, __half2float(d1), s1, __half2float(d2), s2); //__trap(); } } } __device__ half dst; __global__ void ProfileBuiltin(half x) { #pragma unroll 1 for (int i = 0; i < 10000000; ++i) { x = x / x; } dst = x; } __global__ void ProfilePromote(half x) { #pragma unroll 1 for (int i = 0; i < 10000000; ++i) { x = hdiv_promote(x, x); } dst = x; } __global__ void ProfileNewton(half x) { #pragma unroll 1 for (int i = 0; i < 10000000; ++i) { x = hdiv_newton(x, x); } dst = x; } __global__ void ProfileApprox(half x) { #pragma unroll 1 for (int i = 0; i < 10000000; ++i) { x = hdiv_approx(x, x); } dst = x; } int main() { CheckCorrectness<<<256, 256>>>(); half one = __float2half(1.0f); ProfileBuiltin<<<1, 1>>>(one); // 1.001s ProfilePromote<<<1, 1>>>(one); // 0.560s ProfileNewton<<<1, 1>>>(one); // 0.508s ProfileApprox<<<1, 1>>>(one); // 0.304s auto status = cudaDeviceSynchronize(); printf("%s\n", cudaGetErrorString(status)); } ``` Reviewed By: herhut Differential Revision: https://reviews.llvm.org/D126158
-
Peter Klausler authored
Besides raising the IEEE floating-point overflow exception, treat a floating-point overflow on input as an I/O error catchable with ERR=, IOSTAT=, &/or IOMSG=. Differential Revision: https://reviews.llvm.org/D127022
-
Amir Ayupov authored
# Stash local changes before checkout. # Print a message that the source repository revision has been changed, with instructions to switch back. # Make the script executable. # Print sample instructions how to run bolt tests. # Assume that llvm-bolt-wrapper script is in the same source directory. Reviewed By: rafauler Differential Revision: https://reviews.llvm.org/D126941
-
Peter Klausler authored
F18 preserves lower bounds of explicit-shape named constant arrays, but failed to also do so for implicit-shape named constants. Fix. Differential Revision: https://reviews.llvm.org/D127021
-
Peter Klausler authored
It was possible for RANDOM_NUMBER() to return 1.0. Differential Revision: https://reviews.llvm.org/D127020
-
Fangrui Song authored
This reverts commit dcf3368e. It breaks -DLLVM_ENABLE_ASSERTIONS=on builds. In addition, the description is incorrect about ld.lld behavior. For wasm, there should be justification to add the new mode.
-
Peter Klausler authored
In the USE statements that f18 emits to module files, ensure that symbols from intrinsic modules are marked as such on their USE statements. And ensure that the current working directory (".") cannot override the intrinsic module search path when trying to locate an intrinsic module. Differential Revision: https://reviews.llvm.org/D127019
-
Fangrui Song authored
Similar to 557efc9a
-
Fangrui Song authored
Similar to 557efc9a
-
Fangrui Song authored
Some cl::ZeroOrMore were added to avoid the `may only occur zero or one times!` error. More were added due to cargo cult. Since the error has been removed, cl::ZeroOrMore is unneeded. Also remove cl::init(false) while touching the lines.
-
LiaoChunyu authored
D54205 handles fnmadd: -rs1 * rs2 - rs3 This patch add fnmadd: -(rs1 * rs2 + rs3) (the nsz flag on the FMA) Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D126852
-
varconst authored
-
varconst authored
-
Yuta Saito authored
As well as ELF linker does, retain all data segments named X referenced through `__start_X` or `__stop_X`. For example, `FOO_MD` should not be stripped in the below case, but it's currently mis-stripped ```llvm @FOO_MD = global [4 x i8] c"bar\00", section "foo_md", align 1 @__start_foo_md = external constant i8* @__stop_foo_md = external constant i8* @llvm.used = appending global [1 x i8*] [i8* bitcast (i32 ()* @foo_md_size to i8*)], section "llvm.metadata" define i32 @foo_md_size() { entry: ret i32 sub ( i32 ptrtoint (i8** @__stop_foo_md to i32), i32 ptrtoint (i8** @__start_foo_md to i32) ) } ``` This fixes https://github.com/llvm/llvm-project/issues/55839 Reviewed By: sbc100 Differential Revision: https://reviews.llvm.org/D126950
-
Peter Klausler authored
The algorithm was wrong for higher dimensions, and so were the expected test results. Rework. Differential Revision: https://reviews.llvm.org/D127018
-
Fangrui Song authored
Array Operator new Cookies help lsan find allocations, while std::array can't.
-