Log into the cluster

  1. Connect to the HCA server through ssh using your credentials.
  2. Connect to the Dibona login node (mb3-host) using the same credentials.
pc:~$ ssh
guest_xx@hca-server:~$ ssh guest_xx@

Warning: The login node mb3-host is an Intel machine, NOT Arm-based.

Accessing Arm-based nodes

The compute nodes are managed by the SLURM controller. You will need to request hardware resources to SLURM in order to access the compute nodes.

Basic SLURM commands

guest_xx@mb3-host:~$ sinfo              # List available queues/partitions
guest_xx@mb3-host:~$ squeue             # List submitted jobs
guest_xx@mb3-host:~$ sbatch <jobscript> # Submit job in batch mode
guest_xx@mb3-host:~$ srun <command>     # Submit job in interactive mode
guest_xx@mb3-host:~$ scancel <jobId>    # Cancel job by ID

Interactive session example

guest_xx@mb3-host:~$ srun --partition=<partitionName> -N 1 --time=00:30:00 --pty bash -i

Jobscript example

guest_xx@mb3-host:~$ cat
#SBATCH --job-name=my_first_job       # Job name
#SBATCH --partition=hackathon        # Queue
#SBATCH --ntasks=64                   # Number of MPI processes
#SBATCH --nodes=1                     # Number of nodes
#SBATCH --cpus-per-task=1             # Number of OpenMP threads
#SBATCH --time=00:15:00               # Time limit hrs:min:sec
#SBATCH --output=%j.out               # Standard output
#SBATCH --error=%j.err                # Error output

# Print machine/job information
pwd; hostname; date
printf "\n" 
printf "Running test\n" 

# Prepare environment
module purge                            # Clean environment modules
module load arm/arm-hpc-compiler/18.4.2 # Load Arm HPC Compiler
module load openmpi2.0.2.14/arm18.4     # Load OpenMPI
module load arm/armie/18.4              # Load Arm Instruction Emulator

# Execute application
srun <binary>
guest_xx@mb3-host:~$ sbatch
Submitted batch job <jobId>

Node allocation example

To allocate a node for later use, use the salloc command. There is currently no limit in the allocation, so please use it carefully!

  1. Allocate one or more specific nodes
    $ salloc --nodelist=pm-nod057 --time=00:30:00
    salloc: Granted job allocation <jobId>

    The command will either exit indicating a successful allocation, or will pause while waits for the required nodes to be available.
  2. Before accessing to the node, refresh the Kerberos ticket. It will ask for your password.
  3. Using squeue command will show the allocation as running job. Now it is possible to access the node via SSH, it should not ask for a password (single sign-on)
    ssh pm-nod057.bullx

    Important: The suffix .bullx must be used! Otherwise the SSH command will not work
  4. After finishing working with the node, exit the allocation using exit at the login node. Otherwise, the node will remain allocated.
    [user@mb3-host ~]$ exit
    salloc: Relinquishing job allocation <jobId>

Compiling a program (ThunderX2)

The preferred procedure for compiling a program is as follows (notice that the login node is not Arm):

  1. Prepare your files in your home directory at the login node.
  2. Use srun to gain access to a single Dibona node interactively, the command will automatically SSH into the node.
    srun -N 1 --time=00:30:00 --pty bash -i
  3. Compile your programs inside a Dibona node. The module environment is available and the most updated version of the modules are advertised in the message of the day (displayed at login).

Once you are ready, go back to the login node and do a job submission with srun/sbatch.

Compiler optimization flags

Arm HPC Compiler

armclang -mcpu=native -O3 -ffast-math

# Warning: For Cavium ThunderX2, -mcpu=native and -mcpu=thunderx2t99 yield different results
# We suggest using -mcpu=thunderx2t99
armclang -mcpu=thunderx2t99 -O3 -ffast-math
armclang -mcpu=thunderx2t99 -Ofast          # Same as above

If you are using fortran, consider adding ''-fstack-arrays''

Please refer to the Porting and Tuning guides for various packages page of the Arm developer portal for more information.


gcc -march=native -mcpu=thunderx2t99 -O3 -ffast-math -ffp-contract=fast

Arm Instruction Emulator

General Information

Arm Instruction Emulator supports emulation of all SVE instructions when running on Armv8-A compatible hardware. Note that the emulator does not support emulation of Armv8.x instructions, namely Armv8.1 and Armv8.2.

How to use it

NOTE: For the following tutorial, will work on applications that can be compiled with the Arm HPC Compiler (i.e., llvm-based compiler).

Prepare your binary

First of all, you need to compile your application with the Arm HPC Compiler. Therefore, we need to load the required modules:

# Make sure you don't have any other module loaded
guest_xx@pm-nodxxx:~$ module purge

# Load the Arm HPC Compiler and the Arm Instruction Emulator modules
guest_xx@pm-nodxxx:~$ module load arm/arm-hpc-compiler/18.4.2 arm/armie/18.4

Now, you need to edit your compilers to use armclang/armclang++/armflang, which are the compilers from the Arm HPC Compiler. We also need to specify to the compiler that we want it to emit SVE instructions. After these additions/modifications, our compiler declaration and its flags should look something like this:


CXX = armclang++
CXXFLAGS = -O3 -mcpu=native -march=armv8-a+sve -ffp-contract=fast


Then compile your binary:

make -j8

At this point, you should have your binary which will use SVE instructions.

MPI Applications

If your code uses MPI you will need to compile it with the Arm HPC Compiler version of your MPI library. To display which MPI flavors are available use the //module avail// command.

# Load the Arm HPC Compiler, the MPI library and the Arm Instruction Emulator modules
guest_xx@pm-nodxxx:~$ module load arm/arm-hpc-compiler/18.4.2 openmpi2.0.2.14/arm18.4 arm/armie/18.4

The compiler might give an error suggesting //version `GLIBCXX_3.4.21' not found//. If this is your case, you should also load the GCC7 module:

# Load the Arm HPC Compiler, the MPI library and the Arm Instruction Emulator modules
guest_xx@pm-nodxxx:~$ module load arm/arm-hpc-compiler/18.4.2 openmpi2.0.2.14/arm18.4 arm/armie/18.4 gcc/7.2.1

Running your binary

The first thing to do is to double check that your binary actually includes SVE instructions. The fastest and easiest way to do it is just executing it. Since the SVE extensions are not available on any Armv8-A SoC at this moment, we will see something like this:

guest_xx@pm-nodxxx:~$ <binary>
Illegal instruction

Once we know for sure our code has SVE instructions, we can continue to executing it with the Arm Instruction Emulator. It will emulate the SVE instructions performed (therefore, the execution time will be larger).

The Arm Instruction Emulator accepts different options:

guest_xx@pm-nodxxx:~$ armie --help
Execute binaries containing SVE instructions on Armv8-A hardware

  armie [emulation parameters] -- <command to execute>

  armie -msve-vector-bits=256 -- ./sve_program
  armie -msve-vector-bits=2048 --iclient -- ./sve_program --opt foo
  armie -e -i -- ./sve_program

  -m<string>                    Architecture specific options. Supported options:
    -msve-vector-bits=<uint>    Vector length to use. Must be a multiple of 128 bits up to 2048 bits
    -mlist-vector-lengths       List all valid vector lengths
  -e, --eclient <client>        An emulation client based on the DynamoRIO API
                                If this is not specified, the default SVE client is used
  -i, --iclient <client>        An instrumentation client based on the DynamoRIO API
  -x, --unsafe-ldstex           Enables a workaround which avoids an exclusive load/store bug on certain AArch64 hardware
                                (See 'Known Issues' in RELEASE_NOTES.txt for details)
  -s, --show-drrun-cmd          Writes the full DynamoRIO drrun command used to execute ArmIE to stderr
                                This can be useful when debugging or developing clients
  -h, --help                    Prints this help message
  -V, --version                 Prints the version

Now, it is time to execute our application with the Arm Instruction Emulator.

Getting traces in Dibona 2.0

Please note that only tracing with ld-preload has been tested

1. Load the modules for your compiler and MPI implementation

module load gcc/7.2.1 openmpi2.0.2.14/gnu7                      # for gcc 7.2.1
module load gcc/8.2.0 openmpi2.0.2.14/gnu8                      # for gcc 8.2.0
module load arm/arm-hpc-compiler/18.4.2 openmpi2.0.2.14/arm18.4 # for arm hpc compiler 18.4.2
module load arm/arm-hpc-compiler/19.0 openmpi3.1.2/arm19.0      # for arm hpc compiler 19.0.0

2. Create or copy the extrae.xml file, you can find an example at:

/dibona_home_nfs/bsc_shared/apps/extrae/gcc7.2.1_openmpi2.0.2.14/3.5.4/share/example/MPI/extrae.xml     # for gcc 7.2.1
/dibona_home_nfs/bsc_shared/apps/extrae/gcc8.2.0_openmpi2.0.2.14/3.5.4/share/example/MPI/extrae.xml     # for gcc 8.2.0
/dibona_home_nfs/bsc_shared/apps/extrae/armhpc18.4.2_openmpi2.0.2.14/3.5.4/share/example/MPI/extrae.xml # for arm hpc compiler 18.4.2
/dibona_home_nfs/bsc_shared/apps/extrae/armhpc19.0.0_openmpi3.1.2/3.5.4/share/example/MPI/extrae.xml    # for arm hpc compiler 19.0.0

3. Create or copy the file, you can find an example at:

/dibona_home_nfs/bsc_shared/apps/extrae/gcc7.2.1_openmpi2.0.2.14/3.5.4/share/example/MPI/ld-preload/     # for gcc 7.2.1
/dibona_home_nfs/bsc_shared/apps/extrae/gcc8.2.0_openmpi2.0.2.14/3.5.4/share/example/MPI/ld-preload/     # for gcc 8.2.0
/dibona_home_nfs/bsc_shared/apps/extrae/armhpc18.4.2_openmpi2.0.2.14/3.5.4/share/example/MPI/ld-preload/ # for arm hpc compiler 18.4.2
/dibona_home_nfs/bsc_shared/apps/extrae/armhpc19.0.0_openmpi3.1.2/3.5.4/share/example/MPI/ld-preload/    # for arm hpc compiler 19.0.0

4. Run you job using the file

Example (arm hpc compiler)


source /dibona_home_nfs/bsc_shared/apps/extrae/armhpc18.4.2_openmpi2.0.2.14/3.5.4/etc/

export EXTRAE_CONFIG_FILE=./extrae.xml
export LD_PRELOAD=${EXTRAE_HOME}/lib/ # For C apps
#export LD_PRELOAD=${EXTRAE_HOME}/lib/ # For Fortran apps

## Run the desired program


#SBATCH --job-name="mb3_wp6_d68.extrae" 
#SBATCH --time=00:30:00
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --output=%j.out

source /usr/share/Modules/init/bash   # For module command. Replace with whatever shell you use 
arm/arm-hpc-compiler/18.4.2 openmpi2.0.2.14/arm18.4

# Run benchmark
mpirun -np 2 ./ ../bin/xhpcg --rt=0 --nx=64

Power Monitoring tools

Job script that can retrieve power/energy data of a multi-node job.
Please note that it relies on GPIO power monitoring method, still not completely documented by Bull.
For each compute node it will produce a file with raw power data (hardcore to read!) + a human readable summary.
Raw power data files have are called "${SLURM_JOB_ID}_${NODE_NAME}.pwr" while the power summary is called "${SLURM_JOB_ID}.pwr".
So if you are monitoring the power of job 4646 running on pm-nod046 and pm-nod093 at the end of the execution of your job you will find in your working directory:

4646_pm-nod046.pwr <--- raw power data of first node
4646_pm-nod093.pwr <--- raw power data of second node
4646.pwr <--- human readable power data of both nodes (one below the other)

Production queue (ThunderX2)

Please note that you NEED to have an active kerberos key (has to be refreshed daily):


You also NEED to have your SSH key into your authorized keys (it is needed only once):
cat ~/.ssh/ >> ~/.ssh/authorized_keys example

#!/bin/bash -x
#SBATCH --job-name=pwr_test
#SBATCH --ntasks=128
#SBATCH --time=05:59:00
#SBATCH --output=%j.out
#SBATCH --exclusive
#SBATCH --partition=production

pwd; hostname; date

echo "**** WHO ****" 

#Please note that you NEED to have an active kerberos key, use the "kinit" command (this has to be refreshed daily)
#You also NEED to have your SSH key into your authorized keys with "cat ~/.ssh/ >> ~/.ssh/authorized_keys" (doing it once is enough)

# User prolog, executed on each of the compute nodes
# It contains the command to start the power monitoring
# User epilog, executed on each of the compute nodes
# It contains the command to stop the power monitoring
# Command to retrieve the power data from the FPGA
# NOTE: it needs to be executed from mb3-host!
# Command to convert FPGA raw data to human readable data
# Application that we want to power monitor

# SLURM command to run the application, including prolog (starting power monitor) and epilog (stopping power monitor)
srun --task-prolog=${GPIO_CMD_START} --task-epilog=${GPIO_CMD_STOP} $SCIENTIFIC_PROGRAM

# Bookkeeping of power data...
touch `pwd`/${SLURM_JOB_ID}.pwr
$HEADER_CMD > `pwd`/${SLURM_JOB_ID}.pwr
for N in `scontrol show hostname $SLURM_JOB_NODELIST` ; do
  touch `pwd`/${SLURM_JOB_ID}_${N}.pwr
  # Retrieving raw power data, one file per compute node
  ssh mb3-host $RETRIEVE_CMD $N > `pwd`/${SLURM_JOB_ID}_${N}.pwr
  # Converting raw power data to human readable format, appending into a single file
  printf "%s," "$N" >> `pwd`/${SLURM_JOB_ID}.pwr
  ssh mb3-host $CONVERT_CMD `pwd`/${SLURM_JOB_ID}_${N}.pwr >> `pwd`/${SLURM_JOB_ID}.pwr

laptop_hca_mb3.png View (42.1 KB) Fabio Banchelli, 02/19/2019 12:18 PM