RISC-V Vector Environment

SDV Workflow
⚠ Usage of compute resources ⚠
Detailed guides
Quickstart guide

The Software Development Vehicles (SDV) are a set of tools that the EPI partners develop to enable software development. They have two fundamental goals:

To perform a fast integration and preliminary validation of IPs developed by partners within the project
To provide an FPGA-based hardware platform where to validate software development and co-design ideas

This wiki aims to provide a collection of guides for the different tools in the SDV ecosystem.

SDV Workflow

Users of the SDV ecosystem are encouraged to follow the workflow depicted in the diagram below.

First of, they port their application to a commercial RISC-V platform. At this stage, the generated binary is RISC-V compatible but it contains only scalar instructions.
The next step is to vectorize the code, either with autovectorization flags and pragmas or with instrinsics, and run it through Vehave, the vector emulation software installed in all RISC-V commercial platforms. This step allows the user to validate the correctness of their code and requires the LLVM compiler developed by BSC and installed in all systems.
Lastly, users move on to the FPGA development platform, where the same vectorized binary runs on a RISC-V core with support for the "V" ISA extension.

The following quickstart guide will go through these three steps using a simple example code.

⚠ Usage of compute resources ⚠

Compute resources are limited, so we encourage users to conduct their experiments through batch jobs (sbatch) instead of interactive sessions (salloc or srun). The batch method allows users to submit multiple jobs to the SLURM queue which will be processed when compute resources are freed without further action from the user.

Please check out the 🔗 Submit jobs in batch mode guide for more info.

Detailed guides

For each workflow step, there is a complete and detailed guide since there are several RISC-V clusters, different Vehave utilities, and other uses for the FPGA platforms.

Users may also want to check the BSC RISC-V Vector Toolchain guide.

Please contact if you face any issues or need extra information about the SDV infrastructure.

Quickstart guide

This quickstart guide provides examples and commands for compiling and executing a code, starting from a scalar binary in the arriesgado-jammy cluster, then simulating with Vehave, and finally executing the vector binary in an FPGA development platform.

To proceed with the guide, you must download in a directory of the SDV systems the Quickstart code. To request access to the BSC SDV platforms please fill in this form. Use SLURM to allocate a commercial RISC-V platform node:

$ salloc -p arriesgado-jammy -N1 -n4 -t 01:30:00
salloc: Granted job allocation 126645
salloc: Waiting for resource configuration
salloc: Nodes arriesgado-3 are ready for job
$ ssh user@arriesgado-3

Scalar binary on RISC-V commercial platforms

The scalar code to be executed is an axpy, a code snippet can be seen below:

void axpy_ref(double a, double *dx, double *dy, int n) {
   int i;
   for (i=0; i<n; i++) {
      dy[i] += a*dx[i];
   }
}

Once logged in to a RISC-V commercial node, load the necessary modules, build the binary with make and run the scalar binary:

$ cd axpy-test
$ module load llvm/EPI-0.7-development
$ make
$ ./axpy_scalar 1024
doing reference axpy
done

The prompt message done means that you compiled and executed correctly.

Vector binary on RISC-V commercial platforms + Vehave

The intrinsics vectorized code to be executed can be seen below:

void axpy_vector(double a, double *dx, double *dy, int n) {
  int i;
  long gvl = __builtin_epi_vsetvl(n, __epi_e64, __epi_m1);
  __epi_1xf64 v_a = BROADCAST_f64(a, gvl);
  
  for (i = 0; i < n;) {
    gvl = __builtin_epi_vsetvl(n - i, __epi_e64, __epi_m1);
    __epi_1xf64 v_dx = __builtin_epi_vload_1xf64(dx, gvl);
    dx+=gvl;
    __epi_1xf64 v_dy = __builtin_epi_vload_1xf64(dy, gvl);
    __epi_1xf64 v_res = __builtin_epi_vfmacc_1xf64(v_dy, v_a, v_dx, gvl);
    __builtin_epi_vstore_1xf64(dy, v_res, gvl);
    dy+=gvl;
    i += gvl;
  }
}

Code can also be vectorized automatically via pragmas, a code snippet can be seen below:

void axpy_vector(double a, double *dx, double *dy, int n) {
   int i;
   #pragma clang loop vectorize(enable)
   for (i=0; i<n; i++) {
      dy[i] += a*dx[i];
   }
}

Once logged in to a RISC-V commercial node, load the necessary modules, build the binary with make and run the intrinsics binary. You need the Vehave tool in order to run these binaries since the vector instructions are emulated.

$ cd axpy-test
$ module load llvm/EPI-0.7-development
$ make
$ module load vehave/EPI-0.7-development
$ vehave ./axpy_intrinsics 1024
[VEHAVE,445114] Boot
[VEHAVE,445114] Vector length is: 4096 bits
[VEHAVE,445114] Maximum LMUL is: 8
[VEHAVE,445114] Boot finished
doing reference axpy
done
doing intrinsics vector axpy
. . .
[VEHAVE,445114] 0x1084c: Execution of '0x0205f127' complete. Resuming
done
Result ok !!!
[VEHAVE,445114] Shutdown
[VEHAVE,445114] Destroying MCInterface object 0x2ec00
[VEHAVE,445114] Destroying Tracing object 0x2c520
[VEHAVE,445114] Destroying VectorCPUEmulator object 0x3e4d0

The prompt message done means that you compiled and executed correctly.

You can execute with a less verbose mode with the following command, this time for the automatically vectorized binary:

$ VEHAVE_TRACE_SINGLE_THREAD=1 VEHAVE_DEBUG_LEVEL=0 VEHAVE_VECTOR_LENGTH=16384 vehave ./axpy_autovectorization 1024
doing reference axpy
done
doing intrinsics vector axpy
[VEHAVE,445257] Trace file of thread 0: 'vehave.445257.trace'
done
Result ok !!!

The prompt message done means that you compiled and executed correctly. This step generates an execution trace file called vehave.<ID>.trace that can be converted in .prv format and studied with Paraver. More details here.

Vector binary on FPGA-SDV

The FPGA-based platform is called fpga-sdv. It has a 1-core CPU, an ubuntu OS, and it runs at 50MHz. One of the benefits is that it runs native RISC-V vector instructions without simulating them. In order to access you also need to allocate it, although the process is different, since we allocate the machine that hosts the fpga, and then log into the fpga itself.

$ salloc -p fpga-sdv -N 1 -t 02:00:00
salloc: Granted job allocation 126653
salloc: Waiting for resource configuration
# This salloc process can take up to 10 minutes so please wait patiently :-)
salloc: Nodes pickle-1 are ready for job
$ ssh user@fpga-sdv-1

❗ Observe that we get a pickle-1 node but log into fpga-sdv-1 instead ❗

Once inside the fpga, execute the previously compiled binaries

$ cd axpy-test
$ ./axpy_scalar 1024
doing reference axpy
done
$ ./axpy_intrinsics 1024
doing reference axpy
done
doing intrinsics vector axpy
done
Result ok !!!
$ ./axpy_autovectorization 1024
doing reference axpy
done
doing intrinsics vector axpy
done
Result ok !!!

The prompt message done means that you compiled and executed correctly.

We highly recommend to compile in the arriesgado-jammy cluster (or ssh riscv to avoid SLURM) and do all the persistent tests, and then using the fpga-sdv cluster for the final tests, due to low number of resources (4 nodes) and low speed.