|
|
If you are looking for the legacy (old) version of this tutorial (which uses Vehave instead of RAVE), you can find it [here](https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment/-/wikis/SDV-Vector-Analysis-Tutorial-(Legacy)).
|
|
|
|
|
|
|
|
|
# Porting and analyzing applications to the EPAC chip using the SDV infraestructure
|
|
|
|
|
|
|
|
|
In this tutorial you will learn how to use the SDV environment to vectorize your applications, from porting them to RISC-V to doing performance analysis.
|
|
|
|
|
|
This tutorial is organized in three main steps (which follow the SDV methodology). You can quickly navigate the tutorial using these hyperlinks:
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
Additionally, you can download this tutorial in a slide format (pdf) [here](https://ssh.hca.bsc.es/epi/ftp/Tutorial/SDV_Vector_Analysis_Tutorial.pdf)
|
|
|
|
|
|
## Prolog
|
|
|
|
|
|
Before we start, we present a list of key concepts used through this tutorial:
|
|
|
|
|
|
* **Traces:** Files that contain information about the execution of a program.
|
|
|
* **Extrae:** BSC’s instrumentation and trace generation library.
|
|
|
* **Instrumentation:** Manually adding directives in your code to add extra information in the traces.
|
|
|
* **Paraver:** BSC’s Trace visualization tool.
|
|
|
* **RAVE:** BSC’s software emulator for vector instructions.
|
|
|
* **Batched jobs:** Non-interactive (offload), queued usage of compute nodes in a cluster.
|
|
|
|
|
|
Additionally, you will use these machines through this tutorial:
|
|
|
|
|
|
* **Your laptop:** Where you will run Paraver to analyze traces.
|
|
|
* **hca-server:** Login node to the SDV cluster.
|
|
|
* **Arriesgado:** Unmatched RISC-V boards for first compilation and scalar execution.
|
|
|
* **synth-hca:** x86 board for cross-compilation and QEMU emulation
|
|
|
* **fpga-sdv:** FPGA’s implementing the EPAC hardware.
|
|
|
|
|
|
|
|
|
## Tutorial code
|
|
|
|
|
|
In this tutorial, we provide you with a sample code specially designed to guide you in the vectorization analysis of an application.
|
|
|
We strongly encourage you to follow the tutorial using this code before testing your application.
|
|
|
|
|
|
The code has a main function containing the initialization of the 2D arrays used by the application, and a loop that iterates for 10 timesteps.
|
|
|
Each timestep calls the `Step` function and then the `ComputeDelta` function.
|
|
|
The `Step` function has three distinct parts that write on three different arrays: "pressures", "new_temperatures", and "volumes". The flow of the application can be seen on the following figure:
|
|
|
|
|
|

|
|
|
|
|
|
Keep in mind that the application does not compute physically meaningful results, even if it contains common operations such as stencils, element-wise matrix operations, or reductions.
|
|
|
|
|
|
## First step: Run scalar code on commercial RISC-V boards
|
|
|
|
|
|
This step is the first contact between your application and RISC-V.
|
|
|
It also serves as an introduction to our HPC system, modules, and nomenclature.
|
|
|
Finally, we also instrument and study the relevance of the different regions of the code.
|
|
|
|
|
|
### 1.1 Accessing the system and setting up the environment
|
|
|
|
|
|
The first step is to *ssh* into our log-in server, **HCA**
|
|
|
```bash
|
|
|
your-machine$ ssh user@ssh.hca.bsc.es
|
|
|
```
|
|
|
Then, you can clone the tutorial's sources:
|
|
|
```bash
|
|
|
hca-server$ git clone https://gitlab.bsc.es/pvizcai1/SDV_Tutorial.git
|
|
|
```
|
|
|
Or download them from the FTP:
|
|
|
```bash
|
|
|
hca-server$ wget https://ssh.hca.bsc.es/epi/ftp/Tutorial/SDV_Tutorial.tar.gz
|
|
|
hca-server$ tar -xzvf SDV_Tutorial.tar.gz
|
|
|
```
|
|
|
|
|
|
From there, you can access a shared RISC-V quad-core *SiFive Unmatched* board (which we call **Arriesgado** nodes) with another *ssh*:
|
|
|
```bash
|
|
|
hca-server$ ssh riscv
|
|
|
```
|
|
|
The file system is shared between **HCA** and the **Arriesgado** nodes, so you do not need to transfer files between machines.
|
|
|
|
|
|
### 1.2 Compiling and running scalar code
|
|
|
|
|
|
There are currently two supported RISC-V vector specifications in our system, `RVV0.7` and `RVV1.0`, with the following limitations:
|
|
|
|
|
|
| Specification | Runs in emulation (RAVE) | Runs in the hardware (FPGA) | Compiles C/C++ | Compiles FORTRAN |
|
|
|
|--|--|--|--|--|
|
|
|
| RVV0.7 | ✅ | ✅ | ✅ | ❌ |
|
|
|
| RVV1.0 | ✅ | ❌ | ✅ | ✅ |
|
|
|
|
|
|
Given that the Tutorial code is written in C and we plan on running in the FPGA, we will use RVV0.7.
|
|
|
You can load the compiler module with this command:
|
|
|
```bash
|
|
|
arriesgado-11$ module load llvm/EPI-0.7-development
|
|
|
```
|
|
|
(drop the "`-0.7`" part of the module name if you want to use RVV1.0)
|
|
|
|
|
|
Then, navigate to the Tutorial sources that you obtained in the previous step and compile the reference code using `make`:
|
|
|
```bash
|
|
|
arriesgado-11$ cd SDV_Tutorial
|
|
|
arriesgado-11$ make reference.x
|
|
|
```
|
|
|
|
|
|
And run it with no arguments:
|
|
|
```bash
|
|
|
arriesgado-11$ ./reference.x
|
|
|
Res: 5620.0032
|
|
|
Res: 1966.3102
|
|
|
Res: 1311.9151
|
|
|
Res: 1039.8516
|
|
|
Res: 881.6927
|
|
|
Res: 785.5846
|
|
|
Res: 715.3618
|
|
|
Res: 666.9991
|
|
|
Res: 627.6045
|
|
|
Res: 598.7502
|
|
|
Microseconds per step: 10947.90
|
|
|
```
|
|
|
### 1.3 Instrumenting the code
|
|
|
|
|
|
In order to facilitate the performance analysis study, we instrument the code to identify its different parts in the performance traces.
|
|
|
|
|
|
We will use different tracing tools in our analysis, mainly `RAVE`, `PAPI`, and `Extrae`. Each has its own API, so we suggest using our unified API `SDV_Tracing` that allows to trace the same instrumented binary with multiple tools. You can read more on this API [here](https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment/-/wikis/Tracing-and-Hardware-Counters).
|
|
|
|
|
|
Both RAVE and Extrae generate traces with tuples of events and values. There are as many *events* as objects that can be traced (e.g., one event for each hardware counter), and the *events* take a *value* at each moment in time.
|
|
|
|
|
|
Instrumenting a code means adding an event identifying the "Code Region", with a different *value* for each part of the code. This way, when we look at performance traces, we can consult the *value* of this *event* to locate us in the execution.
|
|
|
|
|
|
In the `sdv_tracing.h` header file, we define 5 functions to control the instrumentation:
|
|
|
|
|
|
1. **trace_init()** → To be called at the start of the application
|
|
|
2. **trace_name_event_and_values(event, event_name, nvalues, values[], values_names[])** → Creates an event and names its values
|
|
|
3. **trace_event_and_value(x,y)** → Adds at the current timestamp an event=x with value=y
|
|
|
4. **trace_enable()** → Record events and trace after this function call
|
|
|
5. **trace_disable()** → Ignore events and disable tracing after this function call
|
|
|
|
|
|
We use `event=1000` (as a convention) to identify the "Code Region". You can choose another number, but we strongly encourage you to use event=1000 to be compliant with the automatic analysis tools and configuration files we provide.
|
|
|
|
|
|
The generic usage of this instrumentation looks like this:
|
|
|
```c
|
|
|
#include "sdv_tracing.h"
|
|
|
int main(){
|
|
|
int values[] = {0,1,2};
|
|
|
const char * v_names[] = {"Other","zone1","zone2"}; //Names for values[0], values[1], and values[2]
|
|
|
trace_name_event_and_values(1000, "code_region", 3, values, v_names);
|
|
|
trace_init();
|
|
|
trace_disable();
|
|
|
/*...
|
|
|
non-important application work
|
|
|
...*/
|
|
|
trace_enable();
|
|
|
trace_event_and_value(1000,1)
|
|
|
/*...
|
|
|
First region of interest
|
|
|
...*/
|
|
|
trace_event_and_value(1000,0);
|
|
|
|
|
|
trace_event_and_value(1000,2);
|
|
|
/*...
|
|
|
Second region of interest
|
|
|
...*/
|
|
|
trace_event_and_value(1000,0);
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The file `SDV_Tutorial/src/reference-i.c` uses this instrumentation to distinguish 4 main parts of the code: *Pressures*, *Temperatures*, *Volumes*, and *Delta* computations.
|
|
|
|
|
|
You can compile this binary like this:
|
|
|
|
|
|
```bash
|
|
|
arriesgado-11$ module load sdv_trace
|
|
|
arriesgado-11$ make reference-i.x
|
|
|
```
|
|
|
|
|
|
If this binary is run directly (`./reference-i.x`), no special behavior or overhead will be observed.
|
|
|
|
|
|
### 1.4 Running with Extrae
|
|
|
|
|
|
You can run the instrumented binary using Extrae tracing with the following command:
|
|
|
```bash
|
|
|
arriesgado-11$ trace_extrae ./reference-i.x
|
|
|
```
|
|
|
*:warning: If the command is not found, remember to load the `sdv_trace` module!*
|
|
|
|
|
|
|
|
|
This will create the folder `SDV_Tutorial/extrae_prv_traces` with the trace files inside.
|
|
|
|
|
|
From your computer, you can copy the traces and open them with Paraver ([Download here](https://tools.bsc.es/downloads))
|
|
|
```bash
|
|
|
your-machine$ rsync -a user@ssh.hca.bsc.es:~/SDV_Tutorial/extrae_prv_traces .
|
|
|
your-machine$ wxparaver ./extrae_prv_traces/arr-reference-i.x.prv
|
|
|
```
|
|
|
We have prepared Paraver configuration files (cfgs) to aid you in your analysis, which you can download from the FTP:
|
|
|
```bash
|
|
|
your-machine$ wget https://ssh.hca.bsc.es/epi/ftp/Tutorial/paraver_cfgs.tar.gz
|
|
|
your-machine$ tar -xzvf paraver_cfgs.tar.gz
|
|
|
```
|
|
|
Or copy from the hca server:
|
|
|
```bash
|
|
|
your-machine$ scp -r hca:/apps/riscv/sdv_trace/paraver_cfgs .
|
|
|
```
|
|
|
|
|
|
You can load a configuration file like this:
|
|
|
|
|
|

|
|
|
|
|
|
Once the `paraver_cfgs/extrae/time_distribution_per_region.cfg` cfg is loaded, we can identify the code pregion with colors on the first timeline, where the iterative flow of the code is clearly apparent.
|
|
|
The bottom table shows the percentage of execution time spent in each code region. As you can see, all regions are in the same order of magnitude of time, with phase "Delta" being the least time-consuming.
|
|
|
|
|
|
You can also load the `paraver_cfgs/extrae/cpi_per_region.cfg`, to compute the average Cycles per Instruction (CPI) per each code region. A higher CPI normally indicates a more memory-intensive region.
|
|
|
|
|
|

|
|
|
|
|
|
## Second step: Vectorization and QEMU-emulation with RAVE
|
|
|
|
|
|
Once we ensure that the application runs on scalar RISC-V, it is instrumented, and we have a general feeling of the most time-consuming parts of the application, we can proceed to vectorize it.
|
|
|
|
|
|
### 2.1 Compiler autovectorization
|
|
|
|
|
|
The LLVM compiler developed at BSC is capable of automatic vectorization for RVV0.7 and RVV1.0. We recommend using these flags for optimized vector code generation:
|
|
|
```
|
|
|
-O3 -ffast-math -mepi -mllvm -combiner-store-merging=0 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -mcpu=avispado -mllvm -vectorizer-use-vp-strided-load-store -mllvm -enable-mem-access-versioning=0 -mllvm -disable-loop-idiom-memcpy -fno-slp-vectorize
|
|
|
```
|
|
|
|
|
|
You can compile an autovectorized version of the reference code using this make target:
|
|
|
```bash
|
|
|
arriesgado-11$ make reference-vec.x
|
|
|
```
|
|
|
If you run this binary natively on Arriesgado, the program will crash due to an _Illegal instruction_, as the Unmatched boards do not support vector instructions:
|
|
|
```bash
|
|
|
arriesgado-11$ ./reference-vec.x
|
|
|
Illegal instruction (core dumped)
|
|
|
```
|
|
|
### 2.2 Running and tracing with RAVE
|
|
|
|
|
|
To validate the binary, you need to run it using the RAVE emulator. RAVE uses QEMU, which emulates a RISC-V system on an x86 machine. You can run RAVE on your laptop following [this](https://repo.hca.bsc.es/gitlab/pvizcaino/qemu-sdv) guide, but for this tutorial we recommend using the `synth-hca` server in our cluster:
|
|
|
|
|
|
```bash
|
|
|
hca-server$ ssh synth-hca
|
|
|
synth-hca$ module load rave/EPI-0.7
|
|
|
synth-hca$ cd SDV_Tutorial
|
|
|
synth-hca$ rave ./reference-vec.x
|
|
|
```
|
|
|
|
|
|
You can generate profile traces of RAVE by loading the same `sdv_trace` module as before:
|
|
|
```bash
|
|
|
synth-hca$ module load sdv_trace
|
|
|
synth-hca$ trace_rave_0_7 ./reference-vec.x
|
|
|
```
|
|
|
|
|
|
This will print a profiling report at the end of your binary emulation, with information of each executed code region:
|
|
|
```
|
|
|
(...)
|
|
|
Region #38: Event 1000 (code_region), Value 2 (Temperatures)
|
|
|
Moved bytes (Total): 2054589
|
|
|
Moved bytes (scalar): 6589 (0.32 %)
|
|
|
Moved bytes (vector): 2048000 (99.68 %)
|
|
|
tot_instr: 210148
|
|
|
scalar_instr: 194148 (92.39 %)
|
|
|
vsetvl_instr: 1600 (0.76 %)
|
|
|
vector_instr: 14400 (6.85 %)
|
|
|
SEW 8 vector_instr: 0 (0.00 %)
|
|
|
SEW 16 vector_instr: 0 (0.00 %)
|
|
|
SEW 32 vector_instr: 0 (0.00 %)
|
|
|
SEW 64 vector_instr: 14400 (100.00 %)
|
|
|
avg_VL: 32.00 elements
|
|
|
Arith: 6400 (44.44 %)
|
|
|
FP: 6400 (100.00 %)
|
|
|
INT: 0 (0.00 %)
|
|
|
Mem: 8000 (55.56 %)
|
|
|
unit: 8000 (100.00 %)
|
|
|
strided: 0 (0.00 %)
|
|
|
indexed: 0 (0.00 %)
|
|
|
Mask: 0 (0.00 %)
|
|
|
Other: 0 (0.00 %)
|
|
|
(...)
|
|
|
```
|
|
|
|
|
|
The `trace_rave_0_7` command also creates Paraver trace files in the folder `rave_prv_traces`.
|
|
|
|
|
|
|
|
|
You can copy the traces back to your machine and open them with Paraver:
|
|
|
```bash
|
|
|
your-machine$ rsync -a user@ssh.hca.bsc.es:~/SDV_Tutorial/rave_prv_traces .
|
|
|
your-machine$ wxparaver ./rave_prv_traces/rave-reference-vec.x.prv
|
|
|
```
|
|
|
|
|
|
After you have loaded the RAVE trace, open the configuration file `paraver_cfgs/rave/per_phase_cfgs/event_1000_code_region.cfg` and `paraver_cfgs/rave/Instruction_timeline.cfg`:
|
|
|
|
|
|

|
|
|
|
|
|
RAVE traces are intrinsically different than Extrae traces, because the horizontal axis in the "timelines" represents vector instructions instead of time.
|
|
|
On the `Instruction` view, you can see in one row the scalar instructions, and in another row the vector ones. You can zoom into a region to see the exact sequence of instructions.
|
|
|
|
|
|
|
|
|
You can also observe that the blue region "Pressures" and the red region "Volumes" have no vector instructions.
|
|
|
|
|
|
Alternatively, you can load the `paraver_cfgs/rave/per_phase_cfgs/tables_vector_mix_per_phase.cfg`, which opens two tables counting the absolute number of scalar and vector instructions per region, and their relative numbers (also called Vector Mix). Here, you can confirm that Pressures and Volumes are not vectorized at all, and that the Temperatures region only has a Vector Mix of 0.07.
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
### 2.3 Increasing vectorization
|
|
|
|
|
|
In RAVE, we saw that only two out of the four code regions were vectorized. We can look at the warning messages produced when compiling the autovectorized code to find why some regions are not vectorized.
|
|
|
|
|
|
You can stay on synth-hca to cross-compile, there's no need to jump back to Arriesgado nodes:
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ module load llvm/cross/EPI-0.7-development
|
|
|
synth-hca$ make reference-vec.x
|
|
|
(...)
|
|
|
src/reference-i.c:26:43: remark: loop not vectorized: call instruction cannot be vectorized [-Rpass-analysis=loop-vectorize]
|
|
|
double length = (volumes[i*M+j]>1.0) ? cbrt(volumes[i*M+j]) : 0.5;
|
|
|
^
|
|
|
(...)
|
|
|
src/reference-i.c:49:3: remark: loop not vectorized: cannot identify array bounds [-Rpass-analysis=loop-vectorize]
|
|
|
for(int j=0; j<M; ++j){
|
|
|
^
|
|
|
(...)
|
|
|
```
|
|
|
|
|
|
We can see that the *Pressures* code region (line 26) is not being vectorized because it includes a function call inside the inner-most loop. One solution would be to *inline* the function in the loop, but since *cbrt()* is a library function, we cannot easily do that.
|
|
|
Since this loop has other vectorizable work besides the function call, an alternative solution is to separate vectorizable and non-vectorizable code into two different loops (solution in `SDV_Tutorial/src/increase-vec.c`)
|
|
|
|
|
|
Going from this old code:
|
|
|
```c
|
|
|
trace_event_and_value(1000,1);
|
|
|
for(int i=0; i<N; ++i){
|
|
|
for(int j=0; j<M; ++j){
|
|
|
double length = (volumes[i*M+j]>1.0) ? cbrt(volumes[i*M+j]) : 0.5;
|
|
|
pressures[i*M+j] = length + (temperatures[i*M+j]-new_temperatures[i*M+j]);
|
|
|
}
|
|
|
}
|
|
|
trace_event_and_value(1000,0);
|
|
|
```
|
|
|
|
|
|
To this new code:
|
|
|
```c
|
|
|
trace_event_and_value(1000,1);
|
|
|
for(int i=0; i<N; ++i){
|
|
|
for(int j=0; j<M; ++j){
|
|
|
pressures[i*M+j] = (volumes[i*M+j]>1.0) ? cbrt(volumes[i*M+j]) : 0.5;
|
|
|
}
|
|
|
}
|
|
|
trace_event_and_value(1000,5);
|
|
|
for(int i=0; i<N; ++i){
|
|
|
pressures[i*M+j] += (temperatures[i*M+j]-new_temperatures[i*M+j]);
|
|
|
}
|
|
|
}
|
|
|
trace_event_and_value(1000,0);
|
|
|
```
|
|
|
|
|
|
Note that we added a new value (number 5), to distinguish the vectorizable loop. On the main function, we updated the value's names:
|
|
|
```
|
|
|
const char * v_names[] = {"Other","Pressures_cbrt","Temperatures","Volumes","Delta","Pressures_vec"};
|
|
|
int values[] = {0,1,2,3,4,5};
|
|
|
trace_name_event_and_values(1000, "code_region", sizeof(values)/sizeof(values[0]), values, v_names);
|
|
|
trace_init();
|
|
|
```
|
|
|
|
|
|
The second phase that does not vectorize is the *Volumes* computation. The compiler complains that it *"cannot identify array bounds"*. This message normally occurs when the code has indirect/indexed accesses and the compiler cannot assume that they will produce out-of-bounds or aliased accesses.
|
|
|
|
|
|
In most cases, the compiler is being over-conservative not vectorizing this code, so we can reaffirm that the accesses are safe by either:
|
|
|
1. Adding a `#pragma clang loop vectorize(assume_safety)`on top of the loop
|
|
|
2. Declaring the data pointers as "restrict".
|
|
|
|
|
|
In this case, we will follow the second option:
|
|
|
```c
|
|
|
void Step(int N, int M, double * restrict volumes, double * restrict pressures, double * restrict temperatures, double * restrict new_temperatures, int BLOCK_DIM_X, int BLOCK_DIM_Y, int * restrict bounds){
|
|
|
```
|
|
|
|
|
|
Looking at the compiler output, the other loops not being vectorized are the initialization of the arrays using the `rand()` function. Since we are not including them in the performance analysis, we will leave them as they are. For more information on compiler messages, check the related FAQ [here](https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment/-/wikis/RVV-and-SDV-FAQ#the-compiler-reports-loop-not-vectorized-on-my-loops-why)
|
|
|
|
|
|
|
|
|
Additionally, we saw that the region "Temperatures" had a low Vector Mix of 0.07. The compiler does not give us additional insight, and looking at the source code, the loop body does not have indexed acceses or function calls.
|
|
|
When we encounter this kind of situation, it is helpful to make the loop more compiler-friendly with these three tricks:
|
|
|
1. Change the induction variables type from `int` to `long`
|
|
|
2. Add the `#pragma clang loop vectorize(assume_safety)` on top of the vectorizable loop or make pointers "restrict"
|
|
|
3. Move constant loop bounds known at compile time to `defines` (e.g. Block sizes)
|
|
|
|
|
|
```c
|
|
|
#define BLOCK_DIM_X 32
|
|
|
#define BLOCK_DIM_Y 32
|
|
|
void Step(int N, int M, double * restrict volumes, double * restrict pressures, double * restrict temperatures, double * restrict new_temperatures, int * restrict bounds){
|
|
|
//(...)
|
|
|
for(long block_i=1; block_i<N-BLOCK_DIM_Y; block_i+=BLOCK_DIM_Y){
|
|
|
for(long block_j=1; block_j<M-BLOCK_DIM_X; block_j+=BLOCK_DIM_X){
|
|
|
for(long i=block_i; i<block_i+BLOCK_DIM_Y; ++i){
|
|
|
#pragma clang loop vectorize(assume_safety)
|
|
|
for(long j=block_j; j<block_j+BLOCK_DIM_X; ++j){
|
|
|
new_temperatures[i*M + j] = 0.25*(temperatures[M*i + j + 1]
|
|
|
+ temperatures[M*(i+1) + j]
|
|
|
+ temperatures[M*i + (j-1)]
|
|
|
+ temperatures[M*(i-1) + j]);
|
|
|
}
|
|
|
}
|
|
|
}
|
|
|
}
|
|
|
//(...)
|
|
|
}
|
|
|
```
|
|
|
|
|
|
You can compile this improved version and generate a RAVE trace like this:
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ make increase-vec.x
|
|
|
synth-hca$ trace_rave_0_7 ./increase-vec.x
|
|
|
```
|
|
|
|
|
|
Copy the traces back to your machine and open them with paraver:
|
|
|
|
|
|
```bash
|
|
|
your-machine$ rsync -a user@ssh.hca.bsc.es:~/SDV_Tutorial/rave_prv_traces .
|
|
|
your-machine$ wxparaver rave_prv_traces/rave-increase-vec.x.prv
|
|
|
```
|
|
|
|
|
|
And load once again the `paraver_cfgs/rave/per_phase_cfgs/tables_vector_mix_per_phase.cfg` configuration file. Now, all phases (except the non vectorizable Pressures region) have vector instructions:
|
|
|
|
|
|

|
|
|
|
|
|
### 2.4 Increasing the Vector Length
|
|
|
|
|
|
With the last trace `rave-increase-vec.x.prv` opened in Paraver, load the configuration file `paraver_cfgs/rave/per_phase_cfgs/table_average_vl_per_phase.cfg`. This opens a timeline with the evolution graph of the Vector Length (in Bytes per vector), and a table with the average Vector Length per code phase:
|
|
|
|
|
|

|
|
|
|
|
|
The EPAC chip supports vectors up to 2048 Bytes wide, and the efficiency of the instructions grows with the Vector Length. This means that when we vectorize code, we aim for a vector length as close to 2048 bytes as possible.
|
|
|
|
|
|
As seen in Paraver, this is not the case for the Tutorial's code. *Pressures_vec* has an average VL of 1296B, *Temperatures* 256B, *Volumes* 1228B, and *Delta* 1760B.
|
|
|
|
|
|
If we look into *Volumes* and *Delta*, we will see that the Vector Length is not constant. This is due to the compiler having to make some safety assumptions that force it to execute some non-arithmetic instructions with the maximum Vector Length of the machine. This also means that many useful instructions in these phases are being executed with an even smaller Vector Length than the region's average.
|
|
|
|
|
|
The Vector Length of the vectorized loops is limited by the loop bounds. In *Pressures*, *Volumes*, and *Delta* the inner-most loop iterates from *j=0* to *j=M-1*, with *M=160*. Working with double-precision elements, EPAC's 2048-Byte vectors supports up to 256 elements per vector, many more than 160.
|
|
|
|
|
|
Instead of computing one row at a time on the inner-most loop, we can compute multiple rows at onces:
|
|
|
|
|
|
Before:
|
|
|
```c
|
|
|
for(int i=0; i<N; ++i){
|
|
|
for(int j=0; j<M; ++j){
|
|
|
// Matrix[i*M + j] ...
|
|
|
}
|
|
|
}
|
|
|
```
|
|
|
After:
|
|
|
```c
|
|
|
#define ROWS 2
|
|
|
for(int i=0; i<N; i+=ROWS){
|
|
|
for(int j=0; j<M*ROWS; ++j){
|
|
|
// Matrix[i*M + j] ...
|
|
|
}
|
|
|
}
|
|
|
```
|
|
|
If the body of the inner-most loop allows it, it is recommended to compute all rows of the matrixes in the same loop, generating code that can take advantage of an even bigger VPU (loop bound is now *NxM*):
|
|
|
```c
|
|
|
for(int ij=0; ij<N*M; ij++){
|
|
|
// Matrix[ij]
|
|
|
}
|
|
|
```
|
|
|
|
|
|
We apply this change to phases *Pressures_vec*, *Volumes*, and *Delta* in file `SDV_Tutorial/src/increase-vl.c`.
|
|
|
|
|
|
Phase *Temperatures* presents a *Blocking* structure, which is normally intended to improve vectorization on smaller extensions (e.g. Intel's AVX) or to provide a favorable cache usage. Using long vectors, we want to have much larger blocks, even if this degrades the cache performance (because long vectors are much more resistant to memory latency).
|
|
|
|
|
|
In this case, we cannot easily collapse the loops of the traversal to increase the loop bounds to *NxM*, because the matrix is not contiguously accessed (the rows and columns on the edge of the matrix are not visited). Since we are only vectorizing the inner-most loop, we can only increase the width of the blocks:
|
|
|
```c
|
|
|
#define BLOCK_DIM_X /*32 -> */ 160
|
|
|
#define BLOCK_DIM_Y 32
|
|
|
```
|
|
|
|
|
|
You can compile this version and trace it with RAVE:
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ make increase-vl.x
|
|
|
synth-hca$ trace_rave_0_7 ./increase-vl.x
|
|
|
```
|
|
|
|
|
|
And copy the traces back to your machine
|
|
|
```bash
|
|
|
your-machine$ rsync -a user@ssh.hca.bsc.es:~/SDV_Tutorial/rave_prv_traces .
|
|
|
your-machine$ wxparaver rave_prv_traces/rave-increase-vl.x.prv
|
|
|
```
|
|
|
Then, open the `paraver_cfgs/rave/per_phase_cfgs/table_average_vl_per_phase.cfg` configuration file again:
|
|
|
|
|
|

|
|
|
|
|
|
As you can see, now *Pressures_vec*, *Volumes*, and *Delta* are close to the maximum vector length of 2048 Bytes, and Phase *Temperautres* increased from 256 to 1280 Bytes per instruction.
|
|
|
|
|
|
|
|
|
### 2.5 Avoid mixing datatypes
|
|
|
|
|
|
If we open the `paraver_cfgs/rave/per_phase_cfgs/table_instruction_type_count_per_phase.cfg` and we select the Phase *Volumes*, we can see two instructions that we should be concerned about: `vslidedown` and `vwadd`. Refer to [the annex](#annex-instruction-latency-table) for a list of high-latency instructions you should avoid.
|
|
|
|
|
|

|
|
|
|
|
|
`slidedown`, `slideup`, and instructions that start with `vw*` are normally associated with mixing datatypes in your code.
|
|
|
|
|
|
Phase *Volumes* uses an array of integers (called *bounds*) to index the *pressures* array of doubles:
|
|
|
|
|
|

|
|
|
|
|
|
Integers are 32-bits wide, and doubles are 64-bits wide. In RVV0.7, the width of the loaded data and the indexes must match. To guarantee this, the compiler introduces some bit manipulation instructions, like `vslidedown` and `vwadd` in this case.
|
|
|
|
|
|
One approach to circumvent this issue is to promote the integers to 64 bits. To control this, we recommend always defining flexible datatypes in your application:
|
|
|
|
|
|
```c
|
|
|
typedef double T_FP; //64bits
|
|
|
typedef long long T_INT; //64bits
|
|
|
```
|
|
|
|
|
|
This way, you can easily experiment with the effects of using 32 or 64 bits for both your integer and floating point data. You can find this version on `SDV_Tutorial/src/flex-datatype.c`
|
|
|
|
|
|
Compile it and trace it with RAVE:
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ make flex-datatype.x
|
|
|
synth-hca$ trace_rave_0_7 ./flex-datatype.x
|
|
|
```
|
|
|
|
|
|
And copy the traces back to your machine
|
|
|
|
|
|
```bash
|
|
|
your-machine$ rsync -a user@ssh.hca.bsc.es:~/SDV_Tutorial/rave_prv_traces .
|
|
|
your-machine$ wxparaver rave_prv_traces/rave-flex-datatype.x.prv
|
|
|
```
|
|
|
|
|
|
If you load the `paraver_cfgs/rave/per_phase_cfgs/table_instruction_type_count_per_phase.cfg` again, you will see *Volumes* does not use `vslidedown` and `vwadd` anymore:
|
|
|
|
|
|

|
|
|
|
|
|
[comment]: # (### 2.6 Fusing loops)
|
|
|
|
|
|
## Third step: Natively running vector code on the EPAC RTL (fpga-sdv)
|
|
|
|
|
|
:exclamation: **ONLY AVAILABLE FOR USERS IN THE EPI/EUPILOT CONSORTIUM** :exclamation:
|
|
|
|
|
|
The final step involves running the code on the EPAC hardware implemented in an FPGA to get real-time measurements and further optimize the application.
|
|
|
|
|
|
### 3.1 Sending jobs to the FPGA nodes
|
|
|
|
|
|
Once the RAVE study is completed, we can run the vectorized code in the EPAC RTL (inside an FPGA) to get time measurements.
|
|
|
|
|
|
You can send binaries to be executed in the FPGA node using SLURM. We have prepared the `run_all.sh` script, which runs all versions and parses the output. Send it to the FPGA like this:
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ module load sdv_trace #If not already loaded
|
|
|
synth-hca$ sbatch fpga_job ./run_all.sh
|
|
|
Submitted batch job 189294
|
|
|
```
|
|
|
|
|
|
You can query the state of the FPGA job using `squeue`. If you job appears with an `R` state (fifth column), it is still running.
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ squeue
|
|
|
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
|
|
|
189294 fpga-sdv fpga_job user R 0:21 1 pickle-1
|
|
|
```
|
|
|
|
|
|
Once the job finishes, you can read its output file:
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ cat slurm-189294.out
|
|
|
******************************
|
|
|
* x86 node: pickle-1
|
|
|
* SDV node: fpga-sdv-1
|
|
|
******************************
|
|
|
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
|
|
|
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
|
|
|
|
|
|
version time_per_iteration
|
|
|
reference.x 133090.00
|
|
|
reference-vec.x 71180.30
|
|
|
increase-vec.x 42096.10
|
|
|
increase-vl.x 32437.30
|
|
|
flex-datatype.x 31570.70
|
|
|
|
|
|
```
|
|
|
As you can see, initial vectorization yields a *1.86* speedup (reference-vec), while the last optimization represents a *4.22x* speedup to the reference implementation.
|
|
|
|
|
|
### 3.2 Getting Extrae traces in the FPGA nodes
|
|
|
|
|
|
We recommend using Extrae to trace binaries running in the FPGA, but there are other methods (like PAPI), described [here](https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment/-/wikis/Tracing-and-Hardware-Counters).
|
|
|
|
|
|
You can send Extrae jobs to the FPGA like this:
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ sbatch fpga_job trace_extrae ./flex-datatype.x
|
|
|
```
|
|
|
|
|
|
When the job finishes, you will find the Extrae traces in the folder `extrae_prv_traces`. Copy them back from your machine and open one in Paraver:
|
|
|
|
|
|
```bash
|
|
|
your-machine$ rsync -a user@ssh.hca.bsc.es:~/SDV_Tutorial/extrae_prv_traces .
|
|
|
your-machine$ wxparaver extrae_prv_traces/fpga-flex-datatype.x.prv
|
|
|
```
|
|
|
|
|
|
You can then load the configuration file `paraver_cfgs/extrae/PerfMetrics.cfg`
|
|
|
|
|
|
Besides the Cycles per phase, these configuration files load three important vector performance metrics:
|
|
|
1. **Vector Mix:** Number of vector instructions divided by total instructions (0.0 → 1.0). The higher, the better (>0.2 is considered very good).
|
|
|
2. **Vector Activity:** Proportion of cycles where at least one vector instruction is being computed (0.0 → 1.0). The higher, the better (>0.95 is considered very good)
|
|
|
3. **Vector CPI:** Cycles taken on average per vector instruction. The lower, the better (for VL=256 and double-precision elements, vCPI<35 is very good, 35<vCPI<100 is reasonable, vCPI>100 means high-latency instructions are involved).
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
Extrae traces, unlike RAVE traces, show time in the horizontal axis. We can see that the *Pressures_cbrt* region takes the most time (0.96M cycles), followed by the *Volumes* region (415k cycles).
|
|
|
|
|
|
### 3.3 Vector Mix and Vector Activity
|
|
|
|
|
|
You can also see that the Vector Mix is slightly lower than in the RAVE trace, because Extrae has a bigger overhead on the scalar instructions.
|
|
|
|
|
|
Nevertheless, a measured Vector Mix larger around 0.2 is already utilizing the VPU 90% of the time (Vector Activity). Considering the Extrae overhead, achieving 100% is unfeasible, so 90% is already a good value.
|
|
|
|
|
|
|
|
|
### 3.4 Using huge pages
|
|
|
|
|
|
In the *VecCPI* table, you can see that vector instructions in the *Volumes* phase have an average latency higher than 300 cycles. Even for indirect/indexed accesses, this is a high value.
|
|
|
|
|
|
When this kind of memory accesses report a large VecCPI, one solution is using huge pages (2MB) instead of the default 4KB pages. To do it, add the `huge_pages` script just after the `fpga_job` one.
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ sbatch fpga_job huge_pages run_extrae_fpga ./flex-datatype.x
|
|
|
```
|
|
|
|
|
|
Copy the traces back and open them in Paraver:
|
|
|
|
|
|
```bash
|
|
|
your-machine$ scp -r user@ssh.hca.bsc.es:~/SDV_Tutorial/extrae_prv_traces .
|
|
|
your-machine$ wxparaver extrae_prv_traces/flex-datatype.x-huge.prv
|
|
|
```
|
|
|
|
|
|
Load the `paraver_cfgs/extrae/PerfMetrics` configuration file:
|
|
|
|
|
|

|
|
|
|
|
|
Now, the VecCPI of the *Volumes* phase went down from >300 to 130, and its average duration from 410k cycles to 175k cycles.
|
|
|
|
|
|
Additionally, if you run the binary with huge pages and without instrumentation like this:
|
|
|
|
|
|
```bash
|
|
|
synth-hca$ sbatch fpga_job huge_pages ./flex-datatype.x
|
|
|
```
|
|
|
|
|
|
You will see that the execution time decreased from 31570.70 Microseconds per step to 26603.7 when using huge pages.
|
|
|
|
|
|
## Conclusions
|
|
|
|
|
|
In these plots you can see how we improved the performance of the application following the SDV methodology:
|
|
|

|
|
|
|
|
|
Now, most of the execution time (>66%) is spent on non-vectorized work (the *Pressures_novec* region), so further improvements should focus on that part.
|
|
|
|
|
|
## Annex: Instruction latency table
|
|
|
|
|
|
| **Function** | **Assembly** | **Latency at 256 DP elements** |
|
|
|
|:-------------------:|:-------------------------:|:------------------------------:|
|
|
|
| Division, Sqrt | vfdiv, vfsqrt | +2000 cycles |
|
|
|
| Gather/Scatter | vlxe, vsxe | [256 : 2000] cycles |
|
|
|
| Strided memory | vlse,vsse | [128 : 512] cycles |
|
|
|
| Unit-strided memory | vle, vse | [32 : 128] cycles |
|
|
|
| Widening/Narrowing | vnsrl, vwadd, vfwcvt, ... | [128 : 300] cycles |
|
|
|
| Slides | vslidedown, vslideup, ... | [100: 200] cycles |
|
|
|
| FP reductions | vfredsum, vfredmin, ... | 160 cycles |
|
|
|
| Integer reductions | vredsum, vredand, ... | 100 cycles |
|
|
|
| Arithmetic | vfadd,vsub,vfmadd,... | 35 cycles |
|
|
|
| Moves | vmv.v.x, vmv.v.v, ... | 50 cycles |
|
|
|
:warning: The Software Development Vehicles (SDV) guides have been extended and moved to [this other repository](https://repo.hca.bsc.es/gitlab/epi-public/risc-v-software-development-vehicles/-/wikis/home)
|
|
|
|
|
|
Specifically, the **"SDV Vector Analysis Tutorial"** page has been moved to [this wiki page](https://repo.hca.bsc.es/gitlab/epi-public/risc-v-software-development-vehicles/-/wikis/SDV-Vector-Analysis-Tutorial) |
|
|
\ No newline at end of file |