HILA
|
This file contains #defined constants. More...
Go to the source code of this file.
Macros | |
#define | NDEBUG |
Turn off asserts which are on by default. | |
#define | NDIM 4 |
HILA system dimensionality. | |
#define | DEFAULT_OUTPUT_NAME "output" |
Default output file name. | |
#define | EVEN_SITES_FIRST |
EVEN_SITES_FIRST is default. To traverse odd sites first set -DEVEN_SITES_FIRST=0. | |
#define | NODE_LAYOUT_BLOCK 4 |
#define | WRITE_BUFFER_SIZE 2000000 |
#define | GPU_MEMORY_POOL |
#define | GPU_AWARE_MPI 1 |
#define | GPU_RNG_THREAD_BLOCKS 32 |
#define | GPU_VECTOR_REDUCTION_THREAD_BLOCKS 32 |
#define | GPUFFT_BATCH_SIZE 256 |
#define | GPU_GLOBAL_ARG_MAX_SIZE 2048 |
GPU_SYNCHRONIZE_TIMERS : if set and !=0 synchronize GPU on timer calls, in order to obtain meaningful timer values. | |
#define | N_threads 256 |
General number of threads in a thread block. | |
This file contains #defined constants.
These can be overruled in application Makefile, with APP_OPTS := -DPARAMETER=value.
There are two types of #define variables, True/False switches or parameter variables.
True/False statements can be set with either 0 (False) or 1 (True) as -DPARAMETER=0.
Parameter variables are set similary with -DPARAMETER=var where var is the chosen variable
Definition in file params.h.
#define GPU_AWARE_MPI 1 |
#define GPU_GLOBAL_ARG_MAX_SIZE 2048 |
GPU_SYNCHRONIZE_TIMERS : if set and !=0 synchronize GPU on timer calls, in order to obtain meaningful timer values.
Because GPU kernel launch is asynchronous process, the timers by default may not measure the actual time used in GPU kernel execution. Defining GPU_SYNCHRONIZE_TIMERS inserts GPU synchronization calls to timers. This is off by default, because this may slow down GPU code. Turn on in order to measure more accurately the time spent in different parts of the code.
GPU_GLOBAL_ARG_MAX_SIZE : in some __global__functions gives the max size of variable passed directly as an argument of the function call. Larger value sizes are passed with gpuMemcopy() and a pointer. CUDA < 12.1 limits the total parameter size to 4K, >= 12.1 it is 32K. We set the default to 2K. in HIP/rocm I have not found the size. Passing as an arg is faster, but size limit is relevant only for big "matrices"
#define GPU_MEMORY_POOL |
#define GPU_RNG_THREAD_BLOCKS 32 |
GPU_RNG_THREAD_BLOCKS Number of thread blocks (of N_threads threads) to use in onsites()-loops containing random numbers. GPU_RNG_THREAD_BLOCKS=0 or undefined means use one RNG on each lattice site, and the thread block number is not restricted. RNG takes about 48 B/generator (with XORWOW). When GPU_RNG_THREAD_BLOCKS > 0 only (N_threads * GPU_RNG_THREAD_BLOCKS) generators are in use, which reduces the memory footprint substantially (and bandwidth demand) Too small number slows down onsites()-loops containing RNGs, because less threads are active. Example: Field<Vector<4,double>> vfield; onsites(ALL) { vfield[X].gaussian_random(); // there's RNG here, so this onsites() is handled by // GPU_RNG_THREAD_BLOCKS thread blocks } GPU_RNG_THREAD_BLOCKS<0 disables GPU random numbers entirely, and loops like above will crash if executed. hilapp will emit a warning, but program is compiled
Default: 32 seems to be OK compromise. Can be set to 0 if memory is not a problem.
#define GPU_VECTOR_REDUCTION_THREAD_BLOCKS 32 |
GPU_VECTOR_REDUCTION_THREAD_BLOCKS
A value > 0 for GPU_VECTOR_REDUCTION_THREAD_BLOCKS means that the onsites-loop where the reduction is done is handled by GPU_VECTOR_REDUCTION_THREAD_BLOCKS thread blocks of N_threads threads. Each thread handles its own histogram, thus there are (GPU_VECTOR_REDUCTION_THREAD_BLOCKS*N_threads) working copies of the histogram which are then combined. Too small value slows the loop where this happens computation, too large uses (temporarily) more memory. Example: ReductionVector<double> rv(100); Field<int> index; ... (set index to values 0 .. 99) onsites(ALL) { rv[index[X]] += .. .. }
GPU_VECTOR_REDUCTION_THREAD_BLOCKS = 0 or undefined means that the thread block number is not restricted and only a single histogram is used with atomic operations (atomicAdd). This can be slower, but the performance is GPU hardware/driver dependent. In some cases GPU_VECTOR_REDUCTION_THREAD_BLOCKS = 0 turns out to be faster.
Default: 32 is currently OK compromise (32 thread blocks)
#define GPUFFT_BATCH_SIZE 256 |
#define NDEBUG |
#define NDIM 4 |
#define NODE_LAYOUT_BLOCK 4 |
NODE_LAYOUT_TRIVIAL or NODE_LAYOUT_BLOCK determine how MPI ranks are laid out on logical lattice. TRIVIAL lays out the lattice on logical order where x-direction runs fastest etc. if NODE_LAYOUT_BLOCK is defined, NODE_LAYOUT_BLOCK consecutive MPI ranks are laid out so that these form a compact "block" of ranks logically close togeter. Define NODE_LAYOUT_BLOCK to be the number of MPI processes within one compute node - tries to maximize the use of fast local communications. Either one of these must be defined.