HILA
Loading...
Searching...
No Matches
Dependencies and Installation

Table of Contents

  1. Dependencies
  2. Installation

Dependencies

Hilapp

Dependencies Minimum Version Required
Clang 8 - Yes

Installing dependencies for HILA preprocessor:

NOTE:

If one opts to use a singularity container, skip to HILA applications dependencies.

If one opts to use a docker container, skip to installation section.

For building hilapp, you need clang development tools (actually, only include files). These can be found in most Linux distribution repos, e.g. in Ubuntu 22.04:

export LLVM_VERSION=15
sudo apt-get -y install clang-$LLVM_VERSION \
libclang-$LLVM_VERSION-dev

HILA applications

Dependencies Minimum Version Required
Clang / GCC 8 - / x Yes
FFTW3 x Yes
MPI x Yes
OpenMP x No
CUDA x No
HIP x No

Installing dependencies for HILA applications:

NOTE: If one opts to use docker, skip directly to the installation section.

Installing non GPU dependencies on ubuntu**:

sudo apt install build-essential \
libopenmpi-dev \
libfftw3-dev \
libomp-dev

CUDA:

See NVIDIA drivers and CUDA documentation: https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html

HIP:

See ROCm and HIP documentation: https://docs.amd.com/, https://rocmdocs.amd.com/en/latest/Installation_Guide/HIP-Installation.html

Installation

Begin by cloning HILA repository:

git clone https://github.com/CFT-HY/HILA

The installation process is split into two parts. Building the HILA preprocessor and compiling HILA applications. Both can be installed from source, and both steps have their respective containerization options available. The variety in options is to address different issues which arise in platform dependencies.

When it comes to installing HILA applications there are many avenues one can take depending on their platform. The available platforms and offered methods are listed below, which link to the necessary section in the installation guide.

Platforms

LINUX

HILA has originally been developed on linux, hence all of the available options can be used. The HILA preprocessor can be built from source or with the use of a singualrity container. Additionally one can opt to use the docker container which installs the HILA preprocessor directly.

NOTE: It is advised to use the docker container only for development purposes, since containerization can add computational overhead. This is especially evident in containerized MPI communication.

Containerization of the hilapp on the other hand adds no computational overhead, except in the compilation process, thus for production runs one can use the singularity container and reach maximal computational performance.

MAC

On mac the installation of the HILA preprocessor dependencies and HILA application dependencies can be tedious, and in some cases impossible. Availability of clang libtoolbox is open ended. For this reason the best option is to use the available docker container.

WINDOWS

On windows the installation of the HILA preprocessor dependencies and HILA application dependencies are untested. For this reason the best option is to use the available docker container. One can also opt to use WSL, in this case see LINUX installation instructions.

HPC

On supercomputing platforms the HILA application dependencies are most likely available. The only issue is the availability of the clang libtoolbox which is used in building the HILA preprocessor. Due to the availability of singularity on supercomputing platforms the best solution is to opt to use the singularity container.

After installing the HILA preprocessor with one of the above options one can move on to the building HILA applications section.

Containers

HILA comes with both a singularity and docker container for differing purposes. The aim is to make use easy on any platform be it linux, mac, windows or a supercomputer.

Docker

The docker container is meant to develop and produce HILA applications, libraries and hilapp with ease. One can produce HILA applications on their local machine and run them in a container without having to worry about dependencies. Note that there is overhead when running MPI communication in docker, thus one will not get optimal simulation performance when running highly paralelled code in a container. This is a non issue with small scale simulations or testing.

Docker container instructions


All commands are run in docker folder

Docker image for HILA

Create docker image:

docker build -t hila -f Dockerfile .

Launch image interactively with docker compose

docker compose run --rm hila-applications

Developing with docker

The applications folder is automatically mounted from the local host to the docker image when launching the service hila-applications

../applications:/HILA/applications

This allows one to develop HILA applications directly from source and launch them in the docker image with ease.

When developing HILA libraries and hilapp one can also launch the service hila-source which mounts the HILA/libraries and HILA/hilapp/src folders to the container

docker compose run --rm hila-source

Singularity

The singularity container offers a more packaged approach where one doesn't need to worry about clang libtoolbox support for compiling the HILA pre processor. Hence for HPC platforms where the access of such compiler libraries can be tedious one can simply opt to use the container version of hilapp. This approach is mainly meant to be used for pre processing applications on an HPC platform.

Singularity container instructions


One can download the singularity container hilapp.sif directly from this github repositories release page. If downloaded skip directly to the Using singulartiy container section:

wget https://github.com/CFT-HY/HILA/releases/download/Nightly/hilapp.sif

Installing singularity

Simplest way to install singularity is by downloading the latest .deb or .rpm from github release page and installing directly with ones package manager

Ubuntu:

dpkg -i singularity-ce_$(SINGULARITY_VERSION)-$(UBUNTU_VERSION)_amd64.deb

Building singularity container

NOTE: sudo privileges are required for building a singularity container

For building the container we have two options. One can either build the container using the release version of hilapp from github or one can build using the local hilapp source. Especially in the situation that one is developing the HILA preprocessor and would like to test it on a HPC platform then building the singularity container from a local source is the preferred option. There are two different singularity definition files for both cases.

Building using release version:

sudo singularity build hilapp.sif hilapp_git.def

Building using local source

sudo singularity build hilapp.sif hilapp_local.def

Using singulartiy container

The hilapp.sif file will act as a singularity container and equivalently as the hilapp binary and can be used as such when pre processing HILA code. Thus you can move it to your HILA projects bin folder

mkdir HILA/hilapp/bin
mv hilapp.sif HILA/hilapp/bin/hilapp

Now one can simply move the singularity container to any give supercomputer.

Note that on supercomputers the default paths aren't the same as on default linux operating systems. Thus one will need to mount their HILA source folder to singularity using the APPTAINER_BIND environment variable. Simple navigate to the base of your HILA source directory and run

export APPTAINER_BIND=$(pwd)

Building HILA preprocessor

Before building the preprocessor one must first install the dependencies. See the dependencies

Compile hilapp:

cd hila/hilapp
make [-j4]
make install

This builds hilapp in hila/hilapp/build, and make install moves it to hila/hilapp/bin, which is the default location for the program. Build takes 1-2 min.

NOTE: clang dev libraries are not installed in most supercomputer systems. However, if the system has x86_64 processors (by far most common), you can use make static -command to build statically linked hilapp. Copy hila/hilapp/build/hilapp to directory hila/hilapp/bin on the target machine. Simpler approach for HPC platforms is use of singularity containers

Test that hilapp works:

./bin/hilapp --help
Expected output

$ ./bin/hilapp --help
USAGE: hilapp [options] <source files>
OPTIONS:
Generic Options:
--help - Display available options (--help-hidden for more)
--help-list - Display list of available options (--help-list-hidden for more)
--version - Display the version of this program
hilapp:
--AVXinfo=<int> - AVX vectorization information level 0-2. 0 quiet, 1 not vectorizable loops, 2 all loops
-D <macro[=value]> - Define name/macro for preprocessor
-I <directory> - Directory for include file search
--allow-func-globals - Allow using global or extern variables in functions called from site loops.
This will not work in kernelized code (for example GPUs)
--check-init - Insert checks that Field variables are appropriately initialized before use
--comment-pragmas - Comment out '#pragma hila' -pragmas in output
--dump-ast - Dump AST tree
--function-spec-no-inline - Do not mark generated function specializations "inline"
--gpu-slow-reduce - Use slow (but memory economical) reduction on gpus
--ident-functions - Comment function call types in output
--insert-includes - Insert all project #include files in .cpt -files (portable)
--method-spec-no-inline - Do not mark generated method specializations "inline"
--no-include - Do not insert any '#include'-files (for debug, may not compile)
--no-interleave - Do not interleave communications with computation
--no-output - No output file, for syntax check
-o <filename> - Output file (default: <file>.cpt, write to stdout: -o -
--syntax-only - Same as no-output
--target:AVX - Generate AVX vectorized loops
--target:AVX512 - Generate AVX512 vectorized loops
--target:CUDA - Generate CUDA kernels
--target:HIP - Generate HIP kernels
--target:openacc - Offload to GPU using openACC
--target:openmp - Hybrid OpenMP - MPI
--target:vanilla - Generate loops in place
--target:vectorize=<int> - Generate vectorized loops with given vector size
For example -target:vectorize=32 is equivalent to -target:AVX
--verbosity=<int> - Verbosity level 0-2. Default 0 (quiet)

Building HILA applications

First we will try to build and run a health check test application with the default computing platform which is CPU with MPI enabled. To do so navigate to the applications folder and try:

cd hila/applications/hila_healthcheck
make -j4
./build/hila_healthcheck
Expected output

$ ./build/hila_healthcheck
----- HILA ⩩ lattice framework ---------------------------
Running program ./build/hila_healthcheck
with command line arguments ''
Code version: git SHA d0222bca
Compiled Jun 1 2023 at 11:13:10
with options: EVEN_SITES_FIRST SPECIAL_BOUNDARY_CONDITIONS
Starting -- date Thu Jun 1 11:13:28 2023 run time 8.328e-05s
No runtime limit given
GNU c-library performance: not returning allocated memory
----- Reading file parameters ------------------------------
lattice size 256,256,256
random seed 0
------------------------------------------------------------
------------------------------------------------------------
LAYOUT: lattice size 256 x 256 x 256 = 16777216 sites
Dividing to 1 nodes
Sites on node: 256 x 256 x 256 = 16777216
Processor layout: 1 x 1 x 1 = 1 nodes
Node remapping: NODE_LAYOUT_BLOCK with blocksize 4
Node block size 1 1 1 block division 1 1 1
------------------------------------------------------------
Communication tests done -- date Thu Jun 1 11:13:31 2023 run time 3.11s
------------------------------------------------------------
Random seed from time: 3871436182438
Using node random numbers, seed for node 0: 3871436182438
--- Complex reduction value ( -2.7647453e-17, 5.5294928e-17 ) passed
--- Vector reduction, sum ( -7.1331829e-15, -1.4328816e-15 ) passed
--- Setting and reading a value at [ 37 211 27 ] passed
--- Setting and reading a value at [ 251 220 47 ] passed
--- Setting and reading a value at [ 250 249 134 ] passed
--- Maxloc is [ 112 117 164 ] passed
--- Max value 2 passed
--- Minloc is [ 192 135 27 ] passed
--- Min value -1 passed
--- Field set_elements and get_elements with 51 coordinates passed
--- SiteSelect size 51 passed
--- SiteValueSelect size 51 passed
--- SiteSelect content passed
--- SiteValueSelect content passed
--- SiteIndex passed
--- 2-dimensional slice size 65536 passed
--- slice content passed
--- 1-dimensional slice size 256 passed
--- slice content passed
--- FFT constant field passed
--- FFT inverse transform passed
--- FFT of wave vector [ 132 159 243 ] passed
--- FFT of wave vector [ 167 161 208 ] passed
--- FFT of wave vector [ 152 87 255 ] passed
--- FFT of wave vector [ 156 86 229 ] passed
--- FFT of wave vector [ 78 246 141 ] passed
--- FFT real to complex passed
--- FFT complex to real passed
--- Norm of field = 44434.862 and FFT = 44434.862 passed
--- Norm of binned FFT = 44434.862 passed
--- Binning test at vector [ 100 220 7 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 193 10 49 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 235 241 96 ] passed
--- Spectral density test with above vector passed
TIMER REPORT: total(sec) calls time/call fraction
---------------------------------------------------------------------------
MPI broadcast : 0.000 40 0.263 μs 0.0000
MPI reduction : 0.000 34 2.003 μs 0.0000
FFT total time : 44.544 14 3.182 s 0.6449
copy pencils : 3.261 15 0.217 s 0.0472
MPI for pencils : 0.000 90 1.298 μs 0.0000
FFT plan : 0.003 42 73.150 μs 0.0000
copy fft buffers : 2.412 5505024 0.438 μs 0.0349
FFT execute : 2.356 2752512 0.856 μs 0.0341
pencil reshuffle : 12.967 30 0.432 s 0.1878
save pencils : 26.043 15 1.736 s 0.3771
bin field time : 9.014 7 1.288 s 0.1305
---------------------------------------------------------------------------
No communications done from node 0
Finishing -- date Thu Jun 1 11:14:37 2023 run time 69.07s
------------------------------------------------------------

NOTE: Naturally the run time depends on your system**

And for running with multiple processes:

mpirun -n 4 ./build/hila_healthcheck
Expected output

$ mpirun -n 4 ./build/hila_healthcheck
----- HILA ⩩ lattice framework ---------------------------
Running program ./build/hila_healthcheck
with command line arguments ''
Code version: git SHA d0222bca
Compiled Jun 1 2023 at 11:13:10
with options: EVEN_SITES_FIRST SPECIAL_BOUNDARY_CONDITIONS
Starting -- date Thu Jun 1 11:18:22 2023 run time 0.0001745s
No runtime limit given
GNU c-library performance: not returning allocated memory
----- Reading file parameters ------------------------------
lattice size 256,256,256
random seed 0
------------------------------------------------------------
------------------------------------------------------------
LAYOUT: lattice size 256 x 256 x 256 = 16777216 sites
Dividing to 4 nodes
Sites on node: 256 x 128 x 128 = 4194304
Processor layout: 1 x 2 x 2 = 4 nodes
Node remapping: NODE_LAYOUT_BLOCK with blocksize 4
Node block size 1 2 2 block division 1 1 1
------------------------------------------------------------
Communication tests done -- date Thu Jun 1 11:18:23 2023 run time 1.046s
------------------------------------------------------------
Random seed from time: 4184648360436
Using node random numbers, seed for node 0: 4184648360436
--- Complex reduction value ( -2.7539926e-17, 5.5079939e-17 ) passed
--- Vector reduction, sum ( 1.4328816e-15, -7.4627804e-15 ) passed
--- Setting and reading a value at [ 139 215 41 ] passed
--- Setting and reading a value at [ 231 44 102 ] passed
--- Setting and reading a value at [ 238 201 150 ] passed
--- Maxloc is [ 80 69 74 ] passed
--- Max value 2 passed
--- Minloc is [ 219 105 178 ] passed
--- Min value -1 passed
--- Field set_elements and get_elements with 51 coordinates passed
--- SiteSelect size 51 passed
--- SiteValueSelect size 51 passed
--- SiteSelect content passed
--- SiteValueSelect content passed
--- SiteIndex passed
--- 2-dimensional slice size 65536 passed
--- slice content passed
--- 1-dimensional slice size 256 passed
--- slice content passed
--- FFT constant field passed
--- FFT inverse transform passed
--- FFT of wave vector [ 239 139 86 ] passed
--- FFT of wave vector [ 218 12 247 ] passed
--- FFT of wave vector [ 94 206 99 ] passed
--- FFT of wave vector [ 34 78 96 ] passed
--- FFT of wave vector [ 221 224 199 ] passed
--- FFT real to complex passed
--- FFT complex to real passed
--- Norm of field = 44418.915 and FFT = 44418.915 passed
--- Norm of binned FFT = 44418.915 passed
--- Binning test at vector [ 106 69 123 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 240 142 174 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 226 28 118 ] passed
--- Spectral density test with above vector passed
TIMER REPORT: total(sec) calls time/call fraction
---------------------------------------------------------------------------
MPI broadcast : 0.002 40 49.358 μs 0.0001
MPI reduction : 0.289 34 8.508 ms 0.0120
MPI post receive : 0.000 4 1.782 μs 0.0000
MPI start send : 0.000 4 3.923 μs 0.0000
MPI wait receive : 0.001 4 0.277 ms 0.0000
MPI wait send : 0.002 4 0.404 ms 0.0001
MPI send field : 0.001 15 67.812 μs 0.0000
FFT total time : 14.922 14 1.066 s 0.6182
copy pencils : 1.941 15 0.129 s 0.0804
MPI for pencils : 1.644 90 18.263 ms 0.0681
FFT plan : 0.006 42 0.140 ms 0.0002
copy fft buffers : 1.164 1376256 0.846 μs 0.0482
FFT execute : 0.933 688128 1.355 μs 0.0386
pencil reshuffle : 7.246 30 0.242 s 0.3002
save pencils : 2.994 15 0.200 s 0.1240
bin field time : 2.792 7 0.399 s 0.1157
---------------------------------------------------------------------------
COMMS from node 0: 4 done, 0(0%) optimized away
Finishing -- date Thu Jun 1 11:18:46 2023 run time 24.14s
------------------------------------------------------------

NOTE: Naturally the run time depends on your system**

Now we can try to perform the same health check by targeting a differing computing platform with:

make ARCH=<platform>

where ARCH can take the following values:

ARCH= Description
vanilla default CPU implementation
AVX2 AVX vectorization optimized program using vectorclass
openmp OpenMP parallelized program
cuda Parallel CUDA program
hip Parallel HIP

For cuda compilation one needs to define their CUDA version and architercure either as environment variables or during the make process:

export CUDA_VERSION=11.6
export CUDA_ARCH=61
make ARCH=cuda
or
make ARCH=cuda CUDA_VERSION=11.6 CUDA_ARCH=61

NOTE: Default cuda version is 11.6 and compute architecture is sm_61

Now if we execute the cuda version one should expect the following output

Expected output

$ ./build/hila_healthcheck
GPU devices accessible from node 0: 1
----- HILA ⩩ lattice framework ---------------------------
Running program ./build/hila_healthcheck
with command line arguments ''
Code version: git SHA df945bff
Compiled Jun 2 2023 at 12:57:25
with options: EVEN_SITES_FIRST SPECIAL_BOUNDARY_CONDITIONS
Starting -- date Fri Jun 2 12:58:32 2023 run time 0.08375s
No runtime limit given
Using thread blocks of size 256 threads
Using GPU_AWARE_MPI
ReductionVector with atomic operations (GPU_VECTOR_REDUCTION_THREAD_BLOCKS=0)
CUDA driver version: 12010, runtime 12010
CUDART_VERSION 12010
Device on node rank 0 device 0:
NVIDIA GeForce GTX 1080 capability: 6.1
Global memory: 8113MB
Shared memory: 48kB
Constant memory: 64kB
Block registers: 65536
Warp size: 32
Threads per block: 1024
Max block dimensions: [ 1024, 1024, 64 ]
Max grid dimensions: [ 2147483647, 65535, 65535 ]
Threads in use: 256
OpenMPI library does not support CUDA-Aware MPI
GPU_AWARE_MPI is defined -- THIS MAY CRASH IN MPI
GNU c-library performance: not returning allocated memory
----- Reading file parameters ------------------------------
lattice size 256,256,256
random seed 0
------------------------------------------------------------
------------------------------------------------------------
LAYOUT: lattice size 256 x 256 x 256 = 16777216 sites
Dividing to 1 nodes
Sites on node: 256 x 256 x 256 = 16777216
Processor layout: 1 x 1 x 1 = 1 nodes
Node remapping: NODE_LAYOUT_BLOCK with blocksize 4
Node block size 1 1 1 block division 1 1 1
------------------------------------------------------------
Communication tests done -- date Fri Jun 2 12:58:34 2023 run time 1.827s
------------------------------------------------------------
Random seed from time: 7145975945297229
Using node random numbers, seed for node 0: 7145975945297229
GPU random number generator initialized
GPU random number thread blocks: 32 of size 256 threads
--- Complex reduction value ( -3.9112593e-17, -3.2797116e-18 ) passed
--- Vector reduction, sum ( -7.1331829e-15, -1.3218593e-15 ) passed
--- Setting and reading a value at [ 187 200 25 ] passed
--- Setting and reading a value at [ 70 161 70 ] passed
--- Setting and reading a value at [ 197 191 182 ] passed
--- Maxloc is [ 13 45 107 ] passed
--- Max value 2 passed
--- Minloc is [ 33 24 224 ] passed
--- Min value -1 passed
--- Field set_elements and get_elements with 51 coordinates passed
--- SiteSelect size 51 passed
--- SiteValueSelect size 51 passed
--- SiteSelect content passed
--- SiteValueSelect content passed
--- SiteIndex passed
--- 2-dimensional slice size 65536 passed
--- slice content passed
--- 1-dimensional slice size 256 passed
--- slice content passed
--- FFT constant field passed
--- FFT inverse transform passed
--- FFT of wave vector [ 202 153 38 ] passed
--- FFT of wave vector [ 185 196 66 ] passed
--- FFT of wave vector [ 222 252 82 ] passed
--- FFT of wave vector [ 214 47 8 ] passed
--- FFT of wave vector [ 108 142 205 ] passed
--- FFT real to complex passed
--- FFT complex to real passed
--- Norm of field = 44465.9 and FFT = 44465.9 passed
--- Norm of binned FFT = 44465.9 passed
--- Binning test at vector [ 175 117 16 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 129 107 153 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 237 7 157 ] passed
--- Spectral density test with above vector passed
TIMER REPORT: total(sec) calls time/call fraction
---------------------------------------------------------------------------
MPI broadcast : 0.000 40 0.095 μs 0.0000
MPI reduction : 0.000 34 0.767 μs 0.0000
FFT total time : 4.149 14 0.296 s 0.6021
copy pencils : 0.000 15 2.560 μs 0.0000
MPI for pencils : 4.490 90 49.885 ms 0.6515
FFT plan : 0.016 1 16.047 ms 0.0023
copy fft buffers : 0.001 84 6.129 μs 0.0001
FFT execute : 0.018 42 0.433 ms 0.0026
pencil reshuffle : 0.000 30 3.821 μs 0.0000
save pencils : 0.000 15 3.851 μs 0.0000
bin field time : 0.208 7 29.777 ms 0.0302
---------------------------------------------------------------------------
No communications done from node 0
GPU Memory pool statistics from node 0:
Total pool size 3459.87 MB
# of allocations 268 real allocs 17%
Average free list search 6.3 steps
Average free list size 16 items
Finishing -- date Fri Jun 2 12:58:39 2023 run time 6.891s
------------------------------------------------------------

NOTE: Naturally the run time depends on your system**

Additionally we have some ARCH values tuned for specific HPC platforms:

ARCH Description
lumi CPU-MPI implementation for LUMI supercomputer
lumi-hip GPU-MPI implementation for LUMI supercomputer using HIP
mahti CPU-MPI implementation for MAHTI supercomputer
mahti-cuda GPU-MPI implementation for MAHTI supercomputer using CUDA

We will discuss the computing platforms more in the creating a HILA application guide.