Table of Contents
- Dependencies
- Build and Installation
Dependencies
Hilapp
Dependencies | Minimum Version | Required |
Clang | 8 - | Yes |
Installing dependencies for HILA preprocessor:
NOTE:
If one opts to use a singularity container, skip to HILA applications dependencies.
If one opts to use a docker container, skip to installation section.
For building hilapp, you need clang development tools (actually, only include files). These can be found in most Linux distribution repos, e.g. in Ubuntu 22.04:
export LLVM_VERSION=15
sudo apt-get -y install clang-$LLVM_VERSION \
libclang-$LLVM_VERSION-dev
HILA applications
Dependencies | Minimum Version | Required |
Clang / GCC | 8 - / x | Yes |
FFTW3 | x | Yes |
MPI | x | Yes |
OpenMP | x | No |
CUDA | x | No |
HIP | x | No |
Installing dependencies for HILA applications:
NOTE: If one opts to use docker, skip directly to the installation section.
Installing non GPU dependencies on ubuntu**:
sudo apt install build-essential \
libopenmpi-dev \
libfftw3-dev \
libomp-dev
CUDA:**
See NVIDIA drivers and CUDA documentation: https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html
HIP:**
See ROCm and HIP documentation: https://docs.amd.com/, https://rocmdocs.amd.com/en/latest/Installation_Guide/HIP-Installation.html
Building and Installation
Begin by cloning HILA repository:
git clone https://github.com/CFT-HY/HILA
The installation process is split into two parts. Building the HILA preprocessor and compiling HILA applications. Both can be installed from source with proper dependency supports, and both steps have their respective containerization options available. In certain situation, users may face on issue about building HILA preprocessor. This usually is caused by lack of LLVM/Clang support. There are few tested strategies to deal with this problem, read HILA home page for details. The variety in options is to address different issues which arise in platform dependencies.
When it comes to installing HILA applications there are many avenues one can take depending on their platform. The available platforms and offered methods are listed below, which link to the necessary section in the installation guide.
Platforms
LINUX Workstation
HILA has originally been developed on linux, hence all of the available options can be used. Most of mainstream LINUX distributions should have direct dependency supports of source building from their package mangers. Alternatively, user could also set up pre-built singualrity container of HILA preprocessor to avoid dependencies problem. Additionally one can opt to use the docker container which installs the HILA preprocessor directly.
If the user wishes to constantly use or switch between different versions of LLVM/Clang for new features, a secure and efficient way is using Spack ported LLVM building. This is particularly helpful for building and managing LLVM/Clang easily on HPC platforms. For the details of Spack building, please jump to HPC building section.
NOTE: It is advised to use the docker container only for development purposes, since containerization can add computational overhead. This is especially evident in containerized MPI communication.
Containerization of the hilapp on the other hand adds no computational overhead, except in the compilation process, thus for production runs one can use the singularity container and reach maximal computational performance.
MAC
On mac the installation of the HILA preprocessor dependencies and HILA application dependencies can be tedious, and in some cases impossible. Availability of clang libtoolbox is open ended. For this reason the best option is to use the available docker container.
On the other hand, HPC library building system Spack has quite good support on MacOS, user may go through same process as those on HPC platform to build LLVM/Clang after Spack is properly installed onto MacOS.
WINDOWS
On windows the installation of the HILA preprocessor dependencies and HILA application dependencies are untested. For this reason the best option is to use the available docker container. One can also opt to use WSL, in this case see LINUX installation instructions.
HPC
On supercomputing platforms the HILA application dependencies are most likely available. However, the availability of dependencies of HILA preprocessor frequently turns out be issue, since lack of open-box LLVM/Clang support on HPC platform.
As discussed in HILA home page, there are few tested strategies to deal with this problem. These strategies utilize different properties or features available on hardware architectures or carefully designed library building systems and containerization. Among them, the suggested solutions are library building/managing system and containerization of HILA preprocessor. The former includes the remarkable Spack project, as well as EasyBuild. While best example of the later one on HPC system is the singularity container.
Most of properly maintained computing clusters have pre-installed and -configured Spack system. If user has had experiences of Spack, then it will just works as-is, and building the LLVM/Clang with different versions could be achieved by few commands with Spack module loaded. Moreover, every built LLVM/Clang appears as loadable module which can be used in same way as system modules.
For step-by-step guide of Spack building LLVM/Clang, please read article Use Spack Build and Install LLVM on Workstation or Supercomputer. On computing platform without module system such as Linux Workstation, command spack find --path llvm
will return the installation location, which is necessary information for builing HILA preprocessor. In this case, user may pass the absolute path through shell variable or GNU Make flag.
After installing the HILA preprocessor with one of the above options one can move on to the building HILA applications section.**
Containers
HILA comes with both a singularity and docker container for differing purposes. The aim is to make use easy on any platform be it linux, mac, windows or a supercomputer.
Docker
The docker container is meant to develop and produce HILA applications, libraries and hilapp with ease. One can produce HILA applications on their local machine and run them in a container without having to worry about dependencies. Note that there is overhead when running MPI communication in docker, thus one will not get optimal simulation performance when running highly paralelled code in a container. This is a non issue with small scale simulations or testing.
Docker container instructions
All commands are run in docker
folder
Docker image for HILA
Create docker image:
docker build -t hila -f Dockerfile .
Launch image interactively with docker compose
docker compose run --rm hila-applications
Developing with docker
The applications folder is automatically mounted from the local host to the docker image when launching the service hila-applications
../applications:/HILA/applications
This allows one to develop HILA applications directly from source and launch them in the docker image with ease.
When developing HILA libraries and hilapp one can also launch the service hila-source which mounts the HILA/libraries and HILA/hilapp/src folders to the container
docker compose run --rm hila-source
Singularity
The singularity container offers a more packaged approach where one doesn't need to worry about clang libtoolbox support for compiling the HILA pre processor. Hence for HPC platforms where the access of such compiler libraries can be tedious one can simply opt to use the container version of hilapp. This approach is mainly meant to be used for pre processing applications on an HPC platform.
Singularity container instructions
One can download the singularity container hilapp.sif directly from this github repositories release page. If downloaded skip directly to the Using singulartiy container section:
Installing singularity
Simplest way to install singularity is by downloading the latest .deb or .rpm from github release page and installing directly with ones package manager
Ubuntu:
dpkg -i singularity-ce_$(SINGULARITY_VERSION)-$(UBUNTU_VERSION)_amd64.deb
Building singularity container
NOTE: sudo privileges are required for building a singularity container
For building the container we have two options. One can either build the container using the release version of hilapp from github or one can build using the local hilapp source. Especially in the situation that one is developing the HILA preprocessor and would like to test it on a HPC platform then building the singularity container from a local source is the preferred option. There are two different singularity definition files for both cases.
Building using release version:
sudo singularity build hilapp.sif hilapp_git.def
Building using local source
sudo singularity build hilapp.sif hilapp_local.def
Using singulartiy container
The hilapp.sif file will act as a singularity container and equivalently as the hilapp binary and can be used as such when pre processing HILA code. Thus you can move it to your HILA projects bin folder
mkdir HILA/hilapp/bin
mv hilapp.sif HILA/hilapp/bin/hilapp
Now one can simply move the singularity container to any give supercomputer.
Note that on supercomputers the default paths aren't the same as on default linux operating systems. Thus one will need to mount their HILA source folder to singularity using the APPTAINER_BIND environment variable. Simple navigate to the base of your HILA source directory and run
export APPTAINER_BIND=$(pwd)
Building HILA preprocessor
Before building the preprocessor one must first install the dependencies. See the dependencies
Compile hilapp:
cd hila/hilapp
make [-j4]
make install
This builds hilapp in hila/hilapp/build, and make install
moves it to hila/hilapp/bin, which is the default location for the program. Build takes 1-2 min.
NOTE: clang dev libraries are not installed in most supercomputer systems. However, if the system has x86_64 processors (by far most common), you can use make static
-command to build statically linked hilapp. Copy hila/hilapp/build/hilapp
to directory hila/hilapp/bin
on the target machine. Simpler approach for HPC platforms is use of singularity containers
Test that hilapp works:
./bin/hilapp --help
Expected output
$ ./bin/hilapp --help
USAGE: hilapp [options] <source files>
OPTIONS:
Generic Options:
--help - Display available options (--help-hidden for more)
--help-list - Display list of available options (--help-list-hidden for more)
--version - Display the version of this program
hilapp:
--AVXinfo=<int> - AVX vectorization information level 0-2. 0 quiet, 1 not vectorizable loops, 2 all loops
-D <macro[=value]> - Define name/macro for preprocessor
-I <directory> - Directory for include file search
--allow-func-globals - Allow using global or extern variables in functions called from site loops.
This will not work in kernelized code (for example GPUs)
--check-init - Insert checks that Field variables are appropriately initialized before use
--comment-pragmas - Comment out '#pragma hila' -pragmas in output
--dump-ast - Dump AST tree
--function-spec-no-inline - Do not mark generated function specializations "inline"
--gpu-slow-reduce - Use slow (but memory economical) reduction on gpus
--ident-functions - Comment function call types in output
--insert-includes - Insert all project #include files in .cpt -files (portable)
--method-spec-no-inline - Do not mark generated method specializations "inline"
--no-include - Do not insert any '#include'-files (for debug, may not compile)
--no-interleave - Do not interleave communications with computation
--no-output - No output file, for syntax check
-o <filename> - Output file (default: <file>.cpt, write to stdout: -o -
--syntax-only - Same as no-output
--target:AVX - Generate AVX vectorized loops
--target:AVX512 - Generate AVX512 vectorized loops
--target:CUDA - Generate CUDA kernels
--target:HIP - Generate HIP kernels
--target:openacc - Offload to GPU using openACC
--target:openmp - Hybrid OpenMP - MPI
--target:vanilla - Generate loops in place
--target:vectorize=<int> - Generate vectorized loops with given vector size
For example -target:vectorize=32 is equivalent to -target:AVX
--verbosity=<int> - Verbosity level 0-2. Default 0 (quiet)
Building HILA applications
First we will try to build and run a health check test application with the default computing platform which is CPU with MPI enabled. To do so navigate to the applications folder and try:
cd hila/applications/hila_healthcheck
make -j4
./build/hila_healthcheck
Expected output
$ ./build/hila_healthcheck
----- HILA ⩩ lattice framework ---------------------------
Running program ./build/hila_healthcheck
with command line arguments ''
Code version: git SHA d0222bca
Compiled Jun 1 2023 at 11:13:10
with options: EVEN_SITES_FIRST SPECIAL_BOUNDARY_CONDITIONS
Starting -- date Thu Jun 1 11:13:28 2023 run time 8.328e-05s
No runtime limit given
GNU c-library performance: not returning allocated memory
----- Reading file parameters ------------------------------
lattice size 256,256,256
random seed 0
------------------------------------------------------------
------------------------------------------------------------
LAYOUT: lattice size 256 x 256 x 256 = 16777216 sites
Dividing to 1 nodes
Sites on node: 256 x 256 x 256 = 16777216
Processor layout: 1 x 1 x 1 = 1 nodes
Node remapping: NODE_LAYOUT_BLOCK with blocksize 4
Node block size 1 1 1 block division 1 1 1
------------------------------------------------------------
Communication tests done -- date Thu Jun 1 11:13:31 2023 run time 3.11s
------------------------------------------------------------
Random seed from time: 3871436182438
Using node random numbers, seed for node 0: 3871436182438
--- Complex reduction value ( -2.7647453e-17, 5.5294928e-17 ) passed
--- Vector reduction, sum ( -7.1331829e-15, -1.4328816e-15 ) passed
--- Setting and reading a value at [ 37 211 27 ] passed
--- Setting and reading a value at [ 251 220 47 ] passed
--- Setting and reading a value at [ 250 249 134 ] passed
--- Maxloc is [ 112 117 164 ] passed
--- Max value 2 passed
--- Minloc is [ 192 135 27 ] passed
--- Min value -1 passed
--- Field set_elements and get_elements with 51 coordinates passed
--- SiteSelect size 51 passed
--- SiteValueSelect size 51 passed
--- SiteSelect content passed
--- SiteValueSelect content passed
--- SiteIndex passed
--- 2-dimensional slice size 65536 passed
--- slice content passed
--- 1-dimensional slice size 256 passed
--- slice content passed
--- FFT constant field passed
--- FFT inverse transform passed
--- FFT of wave vector [ 132 159 243 ] passed
--- FFT of wave vector [ 167 161 208 ] passed
--- FFT of wave vector [ 152 87 255 ] passed
--- FFT of wave vector [ 156 86 229 ] passed
--- FFT of wave vector [ 78 246 141 ] passed
--- FFT real to complex passed
--- FFT complex to real passed
--- Norm of field = 44434.862 and FFT = 44434.862 passed
--- Norm of binned FFT = 44434.862 passed
--- Binning test at vector [ 100 220 7 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 193 10 49 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 235 241 96 ] passed
--- Spectral density test with above vector passed
TIMER REPORT: total(sec) calls time/call fraction
---------------------------------------------------------------------------
MPI broadcast : 0.000 40 0.263 μs 0.0000
MPI reduction : 0.000 34 2.003 μs 0.0000
FFT total time : 44.544 14 3.182 s 0.6449
copy pencils : 3.261 15 0.217 s 0.0472
MPI for pencils : 0.000 90 1.298 μs 0.0000
FFT plan : 0.003 42 73.150 μs 0.0000
copy fft buffers : 2.412 5505024 0.438 μs 0.0349
FFT execute : 2.356 2752512 0.856 μs 0.0341
pencil reshuffle : 12.967 30 0.432 s 0.1878
save pencils : 26.043 15 1.736 s 0.3771
bin field time : 9.014 7 1.288 s 0.1305
---------------------------------------------------------------------------
No communications done from node 0
Finishing -- date Thu Jun 1 11:14:37 2023 run time 69.07s
------------------------------------------------------------
NOTE: Naturally the run time depends on your system**
And for running with multiple processes:
mpirun -n 4 ./build/hila_healthcheck
Expected output
$ mpirun -n 4 ./build/hila_healthcheck
----- HILA ⩩ lattice framework ---------------------------
Running program ./build/hila_healthcheck
with command line arguments ''
Code version: git SHA d0222bca
Compiled Jun 1 2023 at 11:13:10
with options: EVEN_SITES_FIRST SPECIAL_BOUNDARY_CONDITIONS
Starting -- date Thu Jun 1 11:18:22 2023 run time 0.0001745s
No runtime limit given
GNU c-library performance: not returning allocated memory
----- Reading file parameters ------------------------------
lattice size 256,256,256
random seed 0
------------------------------------------------------------
------------------------------------------------------------
LAYOUT: lattice size 256 x 256 x 256 = 16777216 sites
Dividing to 4 nodes
Sites on node: 256 x 128 x 128 = 4194304
Processor layout: 1 x 2 x 2 = 4 nodes
Node remapping: NODE_LAYOUT_BLOCK with blocksize 4
Node block size 1 2 2 block division 1 1 1
------------------------------------------------------------
Communication tests done -- date Thu Jun 1 11:18:23 2023 run time 1.046s
------------------------------------------------------------
Random seed from time: 4184648360436
Using node random numbers, seed for node 0: 4184648360436
--- Complex reduction value ( -2.7539926e-17, 5.5079939e-17 ) passed
--- Vector reduction, sum ( 1.4328816e-15, -7.4627804e-15 ) passed
--- Setting and reading a value at [ 139 215 41 ] passed
--- Setting and reading a value at [ 231 44 102 ] passed
--- Setting and reading a value at [ 238 201 150 ] passed
--- Maxloc is [ 80 69 74 ] passed
--- Max value 2 passed
--- Minloc is [ 219 105 178 ] passed
--- Min value -1 passed
--- Field set_elements and get_elements with 51 coordinates passed
--- SiteSelect size 51 passed
--- SiteValueSelect size 51 passed
--- SiteSelect content passed
--- SiteValueSelect content passed
--- SiteIndex passed
--- 2-dimensional slice size 65536 passed
--- slice content passed
--- 1-dimensional slice size 256 passed
--- slice content passed
--- FFT constant field passed
--- FFT inverse transform passed
--- FFT of wave vector [ 239 139 86 ] passed
--- FFT of wave vector [ 218 12 247 ] passed
--- FFT of wave vector [ 94 206 99 ] passed
--- FFT of wave vector [ 34 78 96 ] passed
--- FFT of wave vector [ 221 224 199 ] passed
--- FFT real to complex passed
--- FFT complex to real passed
--- Norm of field = 44418.915 and FFT = 44418.915 passed
--- Norm of binned FFT = 44418.915 passed
--- Binning test at vector [ 106 69 123 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 240 142 174 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 226 28 118 ] passed
--- Spectral density test with above vector passed
TIMER REPORT: total(sec) calls time/call fraction
---------------------------------------------------------------------------
MPI broadcast : 0.002 40 49.358 μs 0.0001
MPI reduction : 0.289 34 8.508 ms 0.0120
MPI post receive : 0.000 4 1.782 μs 0.0000
MPI start send : 0.000 4 3.923 μs 0.0000
MPI wait receive : 0.001 4 0.277 ms 0.0000
MPI wait send : 0.002 4 0.404 ms 0.0001
MPI send field : 0.001 15 67.812 μs 0.0000
FFT total time : 14.922 14 1.066 s 0.6182
copy pencils : 1.941 15 0.129 s 0.0804
MPI for pencils : 1.644 90 18.263 ms 0.0681
FFT plan : 0.006 42 0.140 ms 0.0002
copy fft buffers : 1.164 1376256 0.846 μs 0.0482
FFT execute : 0.933 688128 1.355 μs 0.0386
pencil reshuffle : 7.246 30 0.242 s 0.3002
save pencils : 2.994 15 0.200 s 0.1240
bin field time : 2.792 7 0.399 s 0.1157
---------------------------------------------------------------------------
COMMS from node 0: 4 done, 0(0%) optimized away
Finishing -- date Thu Jun 1 11:18:46 2023 run time 24.14s
------------------------------------------------------------
NOTE: Naturally the run time depends on your system**
Now we can try to perform the same health check by targeting a differing computing platform with:
make ARCH=<platform>
where ARCH can take the following values:
ARCH= | Description |
vanilla | default CPU implementation |
AVX2 | AVX vectorization optimized program using vectorclass |
openmp | OpenMP parallelized program |
cuda | Parallel CUDA program |
hip | Parallel HIP |
For cuda compilation one needs to define their CUDA version and architercure either as environment variables or during the make process:
export CUDA_VERSION=11.6
export CUDA_ARCH=61
make ARCH=cuda
or
make ARCH=cuda CUDA_VERSION=11.6 CUDA_ARCH=61
NOTE: Default cuda version is 11.6 and compute architecture is sm_61*
Now if we execute the cuda version one should expect the following output
Expected output
$ ./build/hila_healthcheck
GPU devices accessible from node 0: 1
----- HILA ⩩ lattice framework ---------------------------
Running program ./build/hila_healthcheck
with command line arguments ''
Code version: git SHA df945bff
Compiled Jun 2 2023 at 12:57:25
with options: EVEN_SITES_FIRST SPECIAL_BOUNDARY_CONDITIONS
Starting -- date Fri Jun 2 12:58:32 2023 run time 0.08375s
No runtime limit given
Using thread blocks of size 256 threads
Using GPU_AWARE_MPI
ReductionVector with atomic operations (GPU_VECTOR_REDUCTION_THREAD_BLOCKS=0)
CUDA driver version: 12010, runtime 12010
CUDART_VERSION 12010
Device on node rank 0 device 0:
NVIDIA GeForce GTX 1080 capability: 6.1
Global memory: 8113MB
Shared memory: 48kB
Constant memory: 64kB
Block registers: 65536
Warp size: 32
Threads per block: 1024
Max block dimensions: [ 1024, 1024, 64 ]
Max grid dimensions: [ 2147483647, 65535, 65535 ]
Threads in use: 256
OpenMPI library does not support CUDA-Aware MPI
GPU_AWARE_MPI is defined -- THIS MAY CRASH IN MPI
GNU c-library performance: not returning allocated memory
----- Reading file parameters ------------------------------
lattice size 256,256,256
random seed 0
------------------------------------------------------------
------------------------------------------------------------
LAYOUT: lattice size 256 x 256 x 256 = 16777216 sites
Dividing to 1 nodes
Sites on node: 256 x 256 x 256 = 16777216
Processor layout: 1 x 1 x 1 = 1 nodes
Node remapping: NODE_LAYOUT_BLOCK with blocksize 4
Node block size 1 1 1 block division 1 1 1
------------------------------------------------------------
Communication tests done -- date Fri Jun 2 12:58:34 2023 run time 1.827s
------------------------------------------------------------
Random seed from time: 7145975945297229
Using node random numbers, seed for node 0: 7145975945297229
GPU random number generator initialized
GPU random number thread blocks: 32 of size 256 threads
--- Complex reduction value ( -3.9112593e-17, -3.2797116e-18 ) passed
--- Vector reduction, sum ( -7.1331829e-15, -1.3218593e-15 ) passed
--- Setting and reading a value at [ 187 200 25 ] passed
--- Setting and reading a value at [ 70 161 70 ] passed
--- Setting and reading a value at [ 197 191 182 ] passed
--- Maxloc is [ 13 45 107 ] passed
--- Max value 2 passed
--- Minloc is [ 33 24 224 ] passed
--- Min value -1 passed
--- Field set_elements and get_elements with 51 coordinates passed
--- SiteSelect size 51 passed
--- SiteValueSelect size 51 passed
--- SiteSelect content passed
--- SiteValueSelect content passed
--- SiteIndex passed
--- 2-dimensional slice size 65536 passed
--- slice content passed
--- 1-dimensional slice size 256 passed
--- slice content passed
--- FFT constant field passed
--- FFT inverse transform passed
--- FFT of wave vector [ 202 153 38 ] passed
--- FFT of wave vector [ 185 196 66 ] passed
--- FFT of wave vector [ 222 252 82 ] passed
--- FFT of wave vector [ 214 47 8 ] passed
--- FFT of wave vector [ 108 142 205 ] passed
--- FFT real to complex passed
--- FFT complex to real passed
--- Norm of field = 44465.9 and FFT = 44465.9 passed
--- Norm of binned FFT = 44465.9 passed
--- Binning test at vector [ 175 117 16 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 129 107 153 ] passed
--- Spectral density test with above vector passed
--- Binning test at vector [ 237 7 157 ] passed
--- Spectral density test with above vector passed
TIMER REPORT: total(sec) calls time/call fraction
---------------------------------------------------------------------------
MPI broadcast : 0.000 40 0.095 μs 0.0000
MPI reduction : 0.000 34 0.767 μs 0.0000
FFT total time : 4.149 14 0.296 s 0.6021
copy pencils : 0.000 15 2.560 μs 0.0000
MPI for pencils : 4.490 90 49.885 ms 0.6515
FFT plan : 0.016 1 16.047 ms 0.0023
copy fft buffers : 0.001 84 6.129 μs 0.0001
FFT execute : 0.018 42 0.433 ms 0.0026
pencil reshuffle : 0.000 30 3.821 μs 0.0000
save pencils : 0.000 15 3.851 μs 0.0000
bin field time : 0.208 7 29.777 ms 0.0302
---------------------------------------------------------------------------
No communications done from node 0
GPU Memory pool statistics from node 0:
Total pool size 3459.87 MB
# of allocations 268 real allocs 17%
Average free list search 6.3 steps
Average free list size 16 items
Finishing -- date Fri Jun 2 12:58:39 2023 run time 6.891s
------------------------------------------------------------
NOTE: Naturally the run time depends on your system**
Additionally we have some ARCH values tuned for specific HPC platforms:
ARCH | Description |
lumi | CPU-MPI implementation for LUMI supercomputer |
lumi-hip | GPU-MPI implementation for LUMI supercomputer using HIP |
mahti | CPU-MPI implementation for MAHTI supercomputer |
mahti-cuda | GPU-MPI implementation for MAHTI supercomputer using CUDA |
We will discuss the computing platforms more in the creating a HILA application guide.