Polly: Examples
Optimize Matrix Multiplication Manually
Polly does not yet focus on end user, but on research and the development of new optimizations. Hence for the users of Polly it is often necessary to understand how Polly works internally. To get an overview of the different steps taken during polyhedral compilation, we give a step by step example on how to use the different Polly passes. For this we optimize a simple matrix multiplication kernel. In case you look for a more automated way of executing Polly, check out the pollycc tool in utils/pollycc.
The files used and created in this example are available here.Create LLVM-IR from the C code
Polly works on LLVM-IR. Hence it is necessary to translate the source files into LLVM-IR. If more than on file should be optimized the files can be combined into a single file with llvm-link.clang -S -emit-llvm matmul.c -o matmul.s
Load Polly automatically when calling the 'opt' tool
Polly is not built into opt or bugpoint, but it is a shared library that needs to be loaded into these tools explicitally. The Polly library is called LVMPolly.so. For a cmake build it is available in the build/lib/ directory, autoconf creates the same file in build/tools/polly/{Release+Asserts|Asserts|Debug}/lib. For convenience we create an alias that automatically loads Polly if 'opt' is called.export PATH_TO_POLLY_LIB="~/polly/build/lib/" alias opt="opt -load ${PATH_TO_POLLY_LIB}/LLVMPolly.so"
Prepare the LLVM-IR for Polly
Polly is only able to work with code that matches a canonical form. To translate the LLVM-IR into this form we use a set of canonicalication passes. For this example only three passes are necessary. To get good coverage on a larger set of input files a larger set is needed. pollycc contains a set of passes that has shown to be beneficial.opt -S -mem2reg -loop-simplify -indvars matmul.s > matmul.preopt.ll
Show the SCoPs detected by Polly (optional)
To understand if Polly was able to detect some SCoPs, we print the structure of the detected SCoPs. In our example two SCoPs were detected. One in 'init_array' the other in 'main'.opt -basicaa -polly-cloog -analyze -q matmul.preopt.ll
init_array(): for (c2=0;c2<=1023;c2++) { for (c4=0;c4<=1023;c4++) { Stmt_5(c2,c4); } } main(): for (c2=0;c2<=1023;c2++) { for (c4=0;c4<=1023;c4++) { Stmt_4(c2,c4); for (c6=0;c6<=1023;c6++) { Stmt_6(c2,c4,c6); } } }
Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)
Polly can use graphviz to graphically show a CFG in which the detected SCoPs are highlighted. It can also create '.dot' files that can be translated by the 'dot' utility into various graphic formats.opt -basicaa -view-scops -disable-output matmul.preopt.ll opt -basicaa -view-scops-only -disable-output matmul.preopt.ll
The output for the different functions
view-scops: main, init_array, print_array
view-scops-only: main, init_array, print_arrayView the polyhedral representation of the SCoPs
opt -basicaa -polly-scops -analyze matmul.preopt.ll
[...] Printing analysis 'Polly - Create polyhedral description of Scops' for region: '%1 => %17' in function 'init_array': Context: { [] } Statements { Stmt_5 Domain := { Stmt_5[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 }; Scattering := { Stmt_5[i0, i1] -> scattering[0, i0, 0, i1, 0] }; WriteAccess := { Stmt_5[i0, i1] -> MemRef_A[1037i0 + i1] }; WriteAccess := { Stmt_5[i0, i1] -> MemRef_B[1047i0 + i1] }; FinalRead Domain := { FinalRead[0] }; Scattering := { FinalRead[i0] -> scattering[200000000, o1, o2, o3, o4] }; ReadAccess := { FinalRead[i0] -> MemRef_A[o0] }; ReadAccess := { FinalRead[i0] -> MemRef_B[o0] }; } Printing analysis 'Polly - Create polyhedral description of Scops' for region: '%0 => <Function Return>' in function 'init_array': [...] Printing analysis 'Polly - Create polyhedral description of Scops' for region: '%1 => %17' in function 'main': Context: { [] } Statements { Stmt_4 Domain := { Stmt_4[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 }; Scattering := { Stmt_4[i0, i1] -> scattering[0, i0, 0, i1, 0, 0, 0] }; WriteAccess := { Stmt_4[i0, i1] -> MemRef_C[1067i0 + i1] }; Stmt_6 Domain := { Stmt_6[i0, i1, i2] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023 }; Scattering := { Stmt_6[i0, i1, i2] -> scattering[0, i0, 0, i1, 1, i2, 0] }; ReadAccess := { Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] }; ReadAccess := { Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] }; ReadAccess := { Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] }; WriteAccess := { Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] }; FinalRead Domain := { FinalRead[0] }; Scattering := { FinalRead[i0] -> scattering[200000000, o1, o2, o3, o4, o5, o6] }; ReadAccess := { FinalRead[i0] -> MemRef_C[o0] }; ReadAccess := { FinalRead[i0] -> MemRef_A[o0] }; ReadAccess := { FinalRead[i0] -> MemRef_B[o0] }; } Printing analysis 'Polly - Create polyhedral description of Scops' for region: '%0 => <Function Return>' in function 'main': Invalid Scop!
Show the dependences for the SCoPs
opt -basicaa -polly-dependences -analyze matmul.preopt.ll
Printing analysis 'Polly - Calculate dependences for SCoP' for region: 'for.cond => for.end28' in function 'init_array': Must dependences: { } May dependences: { } Must no source: { } May no source: { } Printing analysis 'Polly - Calculate dependences for SCoP' for region: 'for.cond => for.end48' in function 'main': Must dependences: { Stmt_4[i0, i1] -> Stmt_6[i0, i1, 0] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023; Stmt_6[i0, i1, i2] -> Stmt_6[i0, i1, 1 + i2] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1022; Stmt_6[i0, i1, 1023] -> FinalRead[0] : i1 <= 1091540 - 1067i0 and i1 >= -1067i0 and i1 >= 0 and i1 <= 1023; Stmt_6[1023, i1, 1023] -> FinalRead[0] : i1 >= 0 and i1 <= 1023 } May dependences: { } Must no source: { Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023; Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023; FinalRead[0] -> MemRef_A[o0]; FinalRead[0] -> MemRef_B[o0] FinalRead[0] -> MemRef_C[o0] : o0 >= 1092565 or (exists (e0 = [(o0)/1067]: o0 <= 1091540 and o0 >= 0 and 1067e0 <= -1024 + o0 and 1067e0 >= -1066 + o0)) or o0 <= -1; } May no source: { }
Export jscop files
Polly can export the polyhedral representation in so called jscop files. Jscop files contain the polyhedral representation stored in a JSON file.opt -basicaa -polly-export-jscop matmul.preopt.ll
Writing SCoP 'for.cond => for.end28' in function 'init_array' to './init_array___%for.cond---%for.end28.jscop'. Writing SCoP 'for.cond => for.end48' in function 'main' to './main___%for.cond---%for.end48.jscop'.
Import the changed jscop files and print the updated SCoP structure (optional)
Polly can import jscop files, where the schedules of the statements were changed. With the help of these updated files we can import transformations into Polly. It is possible to import different jscop files by providing the postfix of the jscop file that is imported.
The optimized jscop files for this example are hand written. The schedule used was inspired by looking at the optimizations PoCC performs. If PoCC is installed Polly can often calculate such schedules fully automatically.
opt -basicaa -polly-import-jscop -polly-print -disable-output matmul.preopt.ll -polly-import-jscop-postfix=.opt
Cannot open file: ./init_array___%for.cond---%for.end28.jscop.opt Skipping import. In function: 'init_array' SCoP: for.cond => for.end28: for (c2=0;c2<=1023;c2++) { for (c4=0;c4<=1023;c4++) { %for.body4(c2,c4); } } Reading SCoP 'for.cond => for.end48' in function 'main' from './main___%for.cond---%for.end48.scop.opt.opt'. In function: 'main' SCoP: for.cond => for.end48: for (c2=0;c2<=1023;c2++) { for (c4=0;c4<=1023;c4++) { %for.body4(c2,c4); } } for (c2=0;c2<=1023;c2++) { for (c3=0;c3<=1023;c3++) { for (c4=0;c4<=1023;c4++) { %for.body12(c2,c4,c3); } } }
Codegenerate the SCoPs
This generates new code for the SCoPs detected by polly. If -polly-import is present, transformations specified in the imported openscop files will be applied.opt -basicaa -polly-import -polly-import-postfix=.opt -polly-codegen matmul.preopt.ll | opt -O3 > matmul.pollyopt.ll
Cannot open file: ./init_array___%for.cond---%for.end28.scop.opt Skipping import. Reading SCoP 'for.cond => for.end48' in function 'main' from './main___%for.cond---%for.end48.scop.opt'.
opt matmul.preopt.ll | opt -O3 > matmul.normalopt.ll
Create the executables
Create one executable optimized with plain -O3 as well as a set of executables optimized in different ways with Polly. One changes only the loop structure, the other adds tiling, the next adds vectorization and finally we use OpenMP parallelism.llc matmul.normalopt.ll -o matmul.normalopt.s && \ gcc matmul.normalopt.s -o matmul.normalopt.exe llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s && \ gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s && \ gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s && \ gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s && \ gcc -lgomp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe
Compare the runtime of the executables
By comparing the runtimes of the different code snippets we see that a simple loop interchange gives here the largest performance boost. However by adding vectorization and by using OpenMP we can further improve the performance significantly.time ./matmul.normalopt.exe
42.68 real, 42.55 user, 0.00 sys
time ./matmul.polly.interchanged.exe
04.33 real, 4.30 user, 0.01 sys
time ./matmul.polly.interchanged+tiled.exe
04.11 real, 4.10 user, 0.00 sys
time ./matmul.polly.interchanged+tiled+vector.exe
01.39 real, 1.36 user, 0.01 sys
time ./matmul.polly.interchanged+tiled+vector+openmp.exe
00.66 real, 2.58 user, 0.02 sys