This section describes various methods for improving LAMMPS performance for different classes of problems running on different kinds of machines.
5.1 Measuring performanceBefore trying to make your simulation run faster, you should understand how it currently performs and where the bottlenecks are.
The best way to do this is run the your system (actual number of atoms) for a modest number of timesteps (say 100, or a few 100 at most) on several different processor counts, including a single processor if possible. Do this for an equilibrium version of your system, so that the 100-step timings are representative of a much longer run. There is typically no need to run for 1000s or timesteps to get accurate timings; you can simply extrapolate from short runs.
For the set of runs, look at the timing data printed to the screen and log file at the end of each LAMMPS run. This section of the manual has an overview.
Running on one (or a few processors) should give a good estimate of the serial performance and what portions of the timestep are taking the most time. Running the same problem on a few different processor counts should give an estimate of parallel scalability. I.e. if the simulation runs 16x faster on 16 processors, its 100% parallel efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
The most important data to look at in the timing info is the timing breakdown and relative percentages. For example, trying different options for speeding up the long-range solvers will have little impact if they only consume 10% of the run time. If the pairwise time is dominating, you may want to look at GPU or OMP versions of the pair style, as discussed below. Comparing how the percentages change as you increase the processor count gives you a sense of how different operations within the timestep are scaling. Note that if you are running with a Kspace solver, there is additional output on the breakdown of the Kspace time. For PPPM, this includes the fraction spent on FFTs, which can be communication intensive.
Another important detail in the timing info are the histograms of atoms counts and neighbor counts. If these vary widely across processors, you have a load-imbalance issue. This often results in inaccurate relative timing data, because processors have to wait when communication occurs for other processors to catch up. Thus the reported times for "Communication" or "Other" may be higher than they really are, due to load-imbalance. If this is an issue, you can uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile LAMMPS, to obtain synchronized timings.
NOTE: this sub-section is still a work in progress
Here is a list of general ideas for improving simulation performance. Most of them are only applicable to certain models and certain bottlenecks in the current performance, so let the timing data you generate be your guide. It is hard, if not impossible, to predict how much difference these options will make, since it is a function of problem size, number of processors used, and your machine. There is no substitute for identifying performance bottlenecks, and trying out various options.
2-FFT PPPM, also called analytic differentiation or ad PPPM, uses 2 FFTs instead of the 4 FFTs used by the default ik differentiation PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to achieve the same accuracy as 4-FFT PPPM. For problems where the FFT cost is the performance bottleneck (typically large problems running on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.
Staggered PPPM performs calculations using two different meshes, one shifted slightly with respect to the other. This can reduce force aliasing errors and increase the accuracy of the method, but also doubles the amount of work required. For high relative accuracy, using staggered PPPM allows one to half the mesh size in each dimension as compared to regular PPPM, which can give around a 4x speedup in the kspace time. However, for low relative accuracy, using staggered PPPM gives little benefit and can be up to 2x slower in the kspace time. For example, the rhodopsin benchmark was run on a single processor, and results for kspace time vs. relative accuracy for the different methods are shown in the figure below. For this system, staggered PPPM (using ik differentiation) becomes useful when using a relative accuracy of slightly greater than 1e-5 and above.
IMPORTANT NOTE: Using staggered PPPM may not give the same increase in accuracy of energy and pressure as it does in forces, so some caution must be used if energy and/or pressure are quantities of interest, such as when using a barostat.
Accelerated versions of various pair_style, fixes, computes, and other commands have been added to LAMMPS, which will typically run faster than the standard non-accelerated versions, if you have the appropriate hardware on your system.
The accelerated styles have the same name as the standard styles, except that a suffix is appended. Otherwise, the syntax for the command is identical, their functionality is the same, and the numerical results it produces should also be identical, except for precision and round-off issues.
For example, all of these styles are variants of the basic Lennard-Jones pair style pair_style lj/cut:
Assuming you have built LAMMPS with the appropriate package, these styles can be invoked by specifying them explicitly in your input script. Or you can use the -suffix command-line switch to invoke the accelerated versions automatically, without changing your input script. The suffix command allows you to set a suffix explicitly and to turn off and back on the comand-line switch setting, both from within your input script.
Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU packages, and can be run on NVIDIA GPUs associated with your CPUs. The speed-up due to GPU usage depends on a variety of factors, as discussed below.
Styles with a "kk" suffix are part of the KOKKOS package, and can be run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends on a variety of factors, as discussed below.
Styles with an "omp" suffix are part of the USER-OMP package and allow a pair-style to be run in multi-threaded mode using OpenMP. This can be useful on nodes with high-core counts when using less MPI processes than cores is advantageous, e.g. when running with PPPM so that FFTs are run on fewer MPI processors or when the many MPI tasks would overload the available bandwidth for communication.
Styles with an "opt" suffix are part of the OPT package and typically speed-up the pairwise calculations of your simulation by 5-25%.
To see what styles are currently available in each of the accelerated packages, see Section_commands 5 of the manual. A list of accelerated styles is included in the pair, fix, compute, and kspace sections. The doc page for each indvidual style (e.g. pair lj/cut or fix nve) will also list any accelerated variants available for that style.
The following sections explain:
The final section compares and contrasts the GPU and USER-CUDA packages, since they are both designed to use NVIDIA hardware.
The OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code.
The procedure for building LAMMPS with the OPT package is simple. It is the same as for any other package which has no additional library dependencies:
make yes-opt make machine
If your input script uses one of the OPT pair styles, you can run it as follows:
lmp_machine -sf opt < in.script mpirun -np 4 lmp_machine -sf opt < in.script
You should see a reduction in the "Pair time" printed out at the end of the run. On most machines and problems, this will typically be a 5 to 20% savings.
The USER-OMP package was developed by Axel Kohlmeyer at Temple University. It provides multi-threaded versions of most pair styles, all dihedral styles, and a few fixes in LAMMPS. The package currently uses the OpenMP interface which requires using a specific compiler flag in the makefile to enable multiple threads; without this flag the corresponding pair styles will still be compiled and work, but do not support multi-threading.
Building LAMMPS with the USER-OMP package:
The procedure for building LAMMPS with the USER-OMP package is simple. You have to edit your machine specific makefile to add the flag to enable OpenMP support to both the CCFLAGS and LINKFLAGS variables. For the GNU compilers and Intel compilers, this flag is called -fopenmp. Check your compiler documentation to find out which flag you need to add. The rest of the compilation is the same as for any other package which has no additional library dependencies:
make yes-user-omp make machine
Please note that this will only install accelerated versions of styles that are already installed, so you want to install this package as the last package, or else you may be missing some accelerated styles. If you plan to uninstall some package, you should first uninstall the USER-OMP package then the other package and then re-install USER-OMP, to make sure that there are no orphaned omp style files present, which would lead to compilation errors.
If your input script uses one of regular styles that are also exist as an OpenMP version in the USER-OMP package you can run it as follows:
env OMP_NUM_THREADS=4 lmp_serial -sf omp -in in.script env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
The value of the environment variable OMP_NUM_THREADS determines how many threads per MPI task are launched. All three examples above use a total of 4 CPU cores. For different MPI implementations the method to pass the OMP_NUM_THREADS environment variable to all processes is different. Two different variants, one for MPICH and OpenMPI, respectively are shown above. Please check the documentation of your MPI installation for additional details. Alternatively, the value provided by OMP_NUM_THREADS can be overridded with the package omp command. Depending on which styles are accelerated in your input, you should see a reduction in the "Pair time" and/or "Bond time" and "Loop time" printed out at the end of the run. The optimal ratio of MPI to OpenMP can vary a lot and should always be confirmed through some benchmark runs for the current system and on the current machine.
Restrictions:
None of the pair styles in the USER-OMP package support the "inner", "middle", "outer" options for r-RESPA integration, only the "pair" option is supported.
Parallel efficiency and performance tips:
In most simple cases the MPI parallelization in LAMMPS is more efficient than multi-threading implemented in the USER-OMP package. Also the parallel efficiency varies between individual styles. On the other hand, in many cases you still want to use the omp version - even when compiling or running without OpenMP support - since they all contain optimizations similar to those in the OPT package, which can result in serial speedup.
Using multi-threading is most effective under the following circumstances:
The best parallel efficiency from omp styles is typically achieved when there is at least one MPI task per physical processor, i.e. socket or die.
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth requirements is not offset by the gain in CPU utilization through hyper-threading.
A description of the multi-threading strategy and some performance examples are presented here
The GPU package was developed by Mike Brown at ORNL and his collaborators. It provides GPU versions of several pair styles, including the 3-body Stillinger-Weber pair style, and for long-range Coulombics via the PPPM command. It has the following features:
Hardware and software requirements:
To use this package, you currently need to have an NVIDIA GPU and install the NVIDIA Cuda software on your system:
Building LAMMPS with the GPU package:
As with other packages that include a separately compiled library, you need to first build the GPU library, before building LAMMPS itself. General instructions for doing this are in this section of the manual. For this package, use a Makefile in lib/gpu appropriate for your system.
Before building the library, you can set the precision it will use by editing the CUDA_PREC setting in the Makefile you are using, as follows:
CUDA_PREC = -D_SINGLE_SINGLE # Single precision for all calculations CUDA_PREC = -D_DOUBLE_DOUBLE # Double precision for all calculations CUDA_PREC = -D_SINGLE_DOUBLE # Accumulation of forces, etc, in double
The last setting is the mixed mode referred to above. Note that your GPU must support double precision to use either the 2nd or 3rd of these settings.
To build the library, then type:
cd lammps/lib/gpu make -f Makefile.linux (see further instructions in lammps/lib/gpu/README)
If you are successful, you will produce the file lib/libgpu.a.
Now you are ready to build LAMMPS with the GPU package installed:
cd lammps/src make yes-gpu make machine
Note that the lo-level Makefile (e.g. src/MAKE/Makefile.linux) has these settings: gpu_SYSINC, gpu_SYSLIB, gpu_SYSPATH. These need to be set appropriately to include the paths and settings for the CUDA system software on your machine. See src/MAKE/Makefile.g++ for an example.
Also note that if you change the GPU library precision, you need to re-build the entire library. You should do a "clean" first, e.g. "make -f Makefile.linux clean". Then you must also re-build LAMMPS if the library precision has changed, so that it re-links with the new library.
Running an input script:
The examples/gpu and bench/GPU directories have scripts that can be run with the GPU package, as well as detailed instructions on how to run them.
The total number of MPI tasks used by LAMMPS (one or multiple per compute node) is set in the usual manner via the mpirun or mpiexec commands, and is independent of the GPU package.
When using the GPU package, you cannot assign more than one physical GPU to an MPI task. However multiple MPI tasks can share the same GPU, and in many cases it will be more efficient to run this way.
Input script requirements to run using pair or PPPM styles with a gpu suffix are as follows:
The default for the package gpu command is to have all the MPI tasks on the compute node use a single GPU. If you have multiple GPUs per node, then be sure to create one or more MPI tasks per GPU, and use the first/last settings in the package gpu command to include all the GPU IDs on the node. E.g. first = 0, last = 1, for 2 GPUs. For example, on an 8-core 2-GPU compute node, if you assign 8 MPI tasks to the node, the following command in the input script
package gpu force/neigh 0 1 -1
would speciy each GPU is shared by 4 MPI tasks. The final -1 will dynamically balance force calculations across the CPU cores and GPUs. I.e. each CPU core will perform force calculations for some small fraction of the particles, at the same time the GPUs perform force calcaultions for the majority of the particles.
Timing output:
As described by the package gpu command, GPU accelerated pair styles can perform computations asynchronously with CPU computations. The "Pair" time reported by LAMMPS will be the maximum of the time required to complete the CPU pair style computations and the time required to complete the GPU pair style computations. Any time spent for GPU-enabled pair styles for computations that run simultaneously with bond, angle, dihedral, improper, and long-range calculations will not be included in the "Pair" time.
When the mode setting for the package gpu command is force/neigh, the time for neighbor list calculations on the GPU will be added into the "Pair" time, not the "Neigh" time. An additional breakdown of the times required for various tasks on the GPU (data copy, neighbor calculations, force computations, etc) are output only with the LAMMPS screen output (not in the log file) at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations.
The output section "GPU Time Info (average)" reports "Max Mem / Proc". This is the maximum memory used at one time on the GPU for data storage by a single MPI process.
Performance tips:
You should experiment with how many MPI tasks per GPU to use to see what gives the best performance for your problem. This is a function of your problem size and what pair style you are using. Likewise, you should also experiment with the precision setting for the GPU library to see if single or mixed precision will give accurate results, since they will typically be faster.
Using multiple MPI tasks per GPU will often give the best performance, as allowed my most multi-core CPU/GPU configurations.
If the number of particles per MPI task is small (e.g. 100s of particles), it can be more eefficient to run with fewer MPI tasks per GPU, even if you do not use all the cores on the compute node.
The Benchmark page of the LAMMPS web site gives GPU performance on a desktop machine and the Titan HPC platform at ORNL for several of the LAMMPS benchmarks, as a function of problem size and number of compute nodes.
The USER-CUDA package was developed by Christian Trott at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions of many pair styles, many fixes, a few computes, and for long-range Coulombics via the PPPM command. It has the following features:
Hardware and software requirements:
To use this package, you need to have specific NVIDIA hardware and install specific NVIDIA CUDA software on your system.
Your NVIDIA GPU needs to support Compute Capability 1.3. This list may help you to find out the Compute Capability of your card:
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
Install the Nvidia Cuda Toolkit in version 3.2 or higher and the corresponding GPU drivers. The Nvidia Cuda SDK is not required for LAMMPSCUDA but we recommend it be installed. You can then make sure that its sample projects can be compiled without problems.
Building LAMMPS with the USER-CUDA package:
As with other packages that include a separately compiled library, you need to first build the USER-CUDA library, before building LAMMPS itself. General instructions for doing this are in this section of the manual. For this package, do the following, using settings in the lib/cuda Makefiles appropriate for your system:
precision=N to set the precision level N = 1 for single precision (default) N = 2 for double precision N = 3 for positions in double precision N = 4 for positions and velocities in double precision arch=M to set GPU compute capability M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default) M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450) M = 13 for CC1.3 (GF200, e.g. C1060, GTX285) prec_timer=0/1 to use hi-precision timers 0 = do not use them (default) 1 = use these timers this is usually only useful for Mac machines dbg=0/1 to activate debug mode 0 = no debug mode (default) 1 = yes debug mode this is only useful for developers cufft=1 to determine usage of CUDA FFT library 0 = no CUFFT support (default) in the future other CUDA-enabled FFT libraries might be supported
Now you are ready to build LAMMPS with the USER-CUDA package installed:
cd lammps/src make yes-user-cuda make machine
Note that the LAMMPS build references the lib/cuda/Makefile.common file to extract setting specific CUDA settings. So it is important that you have first built the cuda library (in lib/cuda) using settings appropriate to your system.
Input script requirements:
Additional input script requirements to run styles with a cuda suffix are as follows:
Performance tips:
The USER-CUDA package offers more speed-up relative to CPU performance when the number of atoms per GPU is large, e.g. on the order of tens or hundreds of 1000s.
As noted above, this package will continue to run a simulation entirely on the GPU(s) (except for inter-processor MPI communication), for multiple timesteps, until a CPU calculation is required, either by a fix or compute that is non-GPU-ized, or until output is performed (thermo or dump snapshot or restart file). The less often this occurs, the faster your simulation will run.
The KOKKOS package contains versions of pair, fix, and atom styles that use data structures and methods and macros provided by the Kokkos library, which is included with LAMMPS in lib/kokkos.
Kokkos is a C++ library that provides two key abstractions for an application like LAMMPS. First, it allows a single implementation of an application kernel (e.g. a pair style) to run efficiently on different kinds of hardware (GPU, Intel Phi, many-core chip).
Second, it provides data abstractions to adjust (at compile time) the memory layout of basic data structures like 2d and 3d arrays and allow the transparent utilization of special hardware load and store units. Such data structures are used in LAMMPS to store atom coordinates or forces or neighbor lists. The layout is chosen to optimize performance on different platforms. Again this operation is hidden from the developer, and does not affect how the single implementation of the kernel is coded.
These abstractions are set at build time, when LAMMPS is compiled with the KOKKOS package installed. This is done by selecting a "host" and "device" to build for, compatible with the compute nodes in your machine. Note that if you are running on a desktop machine, you typically have one compute node. On a cluster or supercomputer there may be dozens or 1000s of compute nodes. The procedure for building and running with the Kokkos library is the same, no matter how many nodes you run on.
All Kokkos operations occur within the context of an individual MPI task running on a single node of the machine. The total number of MPI tasks used by LAMMPS (one or multiple per compute node) is set in the usual manner via the mpirun or mpiexec commands, and is independent of Kokkos.
Kokkos provides support for one or two modes of execution per MPI task. This means that some computational tasks (pairwise interactions, neighbor list builds, time integration, etc) are parallelized in one or the other of the two modes. The first mode is called the "host" and is one or more threads running on one or more physical CPUs (within the node). Currently, both multi-core CPUs and an Intel Phi processor (running in native mode) are supported. The second mode is called the "device" and is an accelerator chip of some kind. Currently only an NVIDIA GPU is supported. If your compute node does not have a GPU, then there is only one mode of execution, i.e. the host and device are the same.
IMPORTNANT NOTE: Currently, if using GPUs, you should set the number of MPI tasks per compute node to be equal to the number of GPUs per compute node. In the future Kokkos will support assigning one GPU to multiple MPI tasks or using multiple GPUs per MPI task. Currently Kokkos does not support AMD GPUs due to limits in the available backend programming models (in particular relative extensive C++ support is required for the Kernel language). This is expected to change in the future.
Here are several examples of how to build LAMMPS and run a simulation using the KOKKOS package for typical compute node configurations. Note that the -np setting for the mpirun command in these examples are for a run on a single node. To scale these examples up to run on a system with N compute nodes, simply multiply the -np setting by N.
All the build steps are performed from within the src directory. All the run steps are performed in the bench directory using the in.lj input script. It is assumed the LAMMPS executable has been copied to that directory or whatever directory the runs are being performed in. Details of the various options are discussed below.
Compute node(s) = dual hex-core CPUs and no GPU:
make yes-kokkos # install the KOKKOS package make g++ OMP=yes # build with OpenMP, no CUDA
mpirun -np 12 lmp_g++ < in.lj # MPI-only mode with no Kokkos mpirun -np 12 lmp_g++ -k on -sf kk < in.lj # MPI-only mode with Kokkos mpirun -np 1 lmp_g++ -k on t 12 -sf kk < in.lj # one MPI task, 12 threads mpirun -np 2 lmp_g++ -k on t 6 -sf kk < in.lj # two MPI tasks, 6 threads/task
Compute node(s) = Intel Phi with 61 cores:
make yes-kokkos make g++ OMP=yes MIC=yes # build with OpenMP for Phi
mpirun -np 12 lmp_g++ -k on t 20 -sf kk < in.lj # 12*20 = 240 total cores mpirun -np 15 lmp_g++ -k on t 16 -sf kk < in.lj mpirun -np 30 lmp_g++ -k on t 8 -sf kk < in.lj mpirun -np 1 lmp_g++ -k on t 240 -sf kk < in.lj
Compute node(s) = dual hex-core CPUs and a single GPU:
make yes-kokkos make cuda CUDA=yes # build for GPU, use src/MAKE/Makefile.cuda
mpirun -np 1 lmp_cuda -k on t 6 -sf kk < in.lj
Compute node(s) = dual 8-core CPUs and 2 GPUs:
make yes-kokkos make cuda CUDA=yes
mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk < in.lj # use both GPUs, one per MPI task
Building LAMMPS with the KOKKOS package:
A summary of the build process is given here. More details and all the available make variable options are given in this section of the manual.
From the src directory, type
make yes-kokkos
to include the KOKKOS package. Then perform a normal LAMMPS build, with additional make variable specifications to choose the host and device you will run the resulting executable on, e.g.
make g++ OMP=yes make cuda CUDA=yes
As illustrated above, the most important variables to set are OMP, CUDA, and MIC. The default settings are OMP=yes, CUDA=no, MIC=no Setting OMP to yes will use OpenMP for threading on the host, as well as on the device (if no GPU is present). Setting CUDA to yes will use one or more GPUs as the device. Setting MIC=yes is necessary when building for an Intel Phi processor.
Note that to use a GPU, you must use a lo-level Makefile, e.g. src/MAKE/Makefile.cuda as included in the LAMMPS distro, which uses the NVIDA "nvcc" compiler. You must check that the CCFLAGS -arch setting is appropriate for your NVIDIA hardware and installed software. Typical values for -arch are given in this section of the manual, as well as other settings that must be included in the lo-level Makefile, if you create your own.
Input scripts and use of command-line switches -kokkos and -suffix:
To use any Kokkos-enabled style provided in the KOKKOS package, you must use a Kokkos-enabled atom style. LAMMPS will give an error if you do not do this.
There are two command-line switches relevant to using Kokkos, -k or -kokkos, and -sf or -suffix. They are described in detail in this section of the manual.
Here are common options to use:
Use of package command options:
Using the package kokkos command in an input script allows choice of options for neighbor lists and communication. See the package command doc page for details and default settings.
Experimenting with different styles of neighbor lists or inter-node communication can provide a speed-up for specific calculations.
Running on a multi-core CPU:
Build with OMP=yes (the default) and CUDA=no (the default).
If N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N, and should typically equal N. Note that the default threads/task is 1, as set by the "t" keyword of the -k command-line switch. If you do not change this, no additional parallelism (beyond MPI) will be invoked on the host CPU(s).
You can compare the performance running in different modes:
Examples of mpirun commands in these modes, for nodes with dual hex-core CPUs and no GPU, are shown above.
Running on GPUs:
Build with CUDA=yes, using src/MAKE/Makefile.cuda. Insure the setting for CUDA_PATH in lib/kokkos/Makefile.lammps is correct for your Cuda software installation. Insure the -arch setting in src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see this section of the manual for details.
The -np setting of the mpirun command should set the number of MPI tasks/node to be equal to the # of physical GPUs on the node.
Use the -kokkos command-line switch to specify the number of GPUs per node, and the number of threads per MPI task. As above for multi-core CPUs (and no GPU), if N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N. With one GPU (and one MPI task) it may be faster to use less than all the available cores, by setting threads/task to a smaller value. This is because using all the cores on a dual-socket node will incur extra cost to copy memory from the 2nd socket to the GPU.
Examples of mpirun commands that follow these rules, for nodes with dual hex-core CPUs and one or two GPUs, are shown above.
Running on an Intel Phi:
Kokkos only uses Intel Phi processors in their "native" mode, i.e. not hosted by a CPU.
Build with OMP=yes (the default) and MIC=yes. The latter insures code is correctly compiled for the Intel Phi. The OMP setting means OpenMP will be used for parallelization on the Phi, which is currently the best option within Kokkos. In the future, other options may be added.
Current-generation Intel Phi chips have either 61 or 57 cores. One core should be excluded to run the OS, leaving 60 or 56 cores. Each core is hyperthreaded, so there are effectively N = 240 (4*60) or N = 224 (4*56) cores to run on.
The -np setting of the mpirun command sets the number of MPI tasks/node. The "-k on t Nt" command-line switch sets the number of threads/task as Nt. The product of these 2 values should be N, i.e. 240 or 224. Also, the number of threads/task should be a multiple of 4 so that logical threads from more than one MPI task do not run on the same physical core.
Examples of mpirun commands that follow these rules, for Intel Phi nodes with 61 cores, are shown above.
Examples and benchmarks:
The examples/kokkos and bench/KOKKOS directories have scripts that can be run with the KOKKOS package, as well as detailed instructions on how to run them.
IMPORTANT NOTE: the bench/KOKKOS directory does not yet exist. It will be added later.
Additional performance issues:
When using threads (OpenMP or pthreads), it is important for performance to bind the threads to physical cores, so they do not migrate during a simulation. The same is true for MPI tasks, but the default binding rules implemented for various MPI versions, do not account for thread binding.
Thus if you use more than one thread per MPI task, you should insure MPI tasks are bound to CPU sockets. Furthermore, use thread affinity environment variables from the OpenMP runtime when using OpenMP and compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc 4.7 or later, intel 12 or later) setting the environment variable OMP_PROC_BIND=true should be sufficient. A typical mpirun command should set these flags:
OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ... Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ...
When using a GPU, you will achieve the best performance if your input script does not use any fix or compute styles which are not yet Kokkos-enabled. This allows data to stay on the GPU for multiple timesteps, without being copied back to the host CPU. Invoking a non-Kokkos fix or compute, or performing I/O for thermo or dump output will cause data to be copied back to the CPU.
You cannot yet assign multiple MPI tasks to the same GPU with the KOKKOS package. We plan to support this in the future, similar to the GPU package in LAMMPS.
You cannot yet use both the host (multi-threaded) and device (GPU) together to compute pairwise interactions with the KOKKOS package. We hope to support this in the future, similar to the GPU package in LAMMPS.
Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways.
As a consequence, for a particular simulation on specific hardware, one package may be faster than the other. We give guidelines below, but the best way to determine which package is faster for your input script is to try both of them on your machine. See the benchmarking section below for examples where this has been done.
Guidelines for using each package optimally:
Differences between the two packages:
Examples:
The LAMMPS distribution has two directories with sample input scripts for the GPU and USER-CUDA packages.
These contain input scripts for identical systems, so they can be used to benchmark the performance of both packages on your system.