Previous Section - LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Commands - Next Section

5. Accelerating LAMMPS performance

This section describes various methods for improving LAMMPS performance for different classes of problems running on different kinds of machines.

5.1 Measuring performance
5.2 General strategies
5.3 Packages with optimized styles
5.4 OPT package
5.5 USER-OMP package
5.6 GPU package
5.7 USER-CUDA package
5.8 KOKKOS package
5.9 USER-INTEL package
5.10 Comparison of USER-CUDA, GPU, and KOKKOS packages

The Benchmark page of the LAMMPS web site gives performance results for the various accelerator packages discussed in this section for several of the standard LAMMPS benchmarks, as a function of problem size and number of compute nodes, on different hardware platforms.



5.1 Measuring performance

Before trying to make your simulation run faster, you should understand how it currently performs and where the bottlenecks are.

The best way to do this is run the your system (actual number of atoms) for a modest number of timesteps (say 100, or a few 100 at most) on several different processor counts, including a single processor if possible. Do this for an equilibrium version of your system, so that the 100-step timings are representative of a much longer run. There is typically no need to run for 1000s or timesteps to get accurate timings; you can simply extrapolate from short runs.

For the set of runs, look at the timing data printed to the screen and log file at the end of each LAMMPS run. This section of the manual has an overview.

Running on one (or a few processors) should give a good estimate of the serial performance and what portions of the timestep are taking the most time. Running the same problem on a few different processor counts should give an estimate of parallel scalability. I.e. if the simulation runs 16x faster on 16 processors, its 100% parallel efficient; if it runs 8x faster on 16 processors, it's 50% efficient.

The most important data to look at in the timing info is the timing breakdown and relative percentages. For example, trying different options for speeding up the long-range solvers will have little impact if they only consume 10% of the run time. If the pairwise time is dominating, you may want to look at GPU or OMP versions of the pair style, as discussed below. Comparing how the percentages change as you increase the processor count gives you a sense of how different operations within the timestep are scaling. Note that if you are running with a Kspace solver, there is additional output on the breakdown of the Kspace time. For PPPM, this includes the fraction spent on FFTs, which can be communication intensive.

Another important detail in the timing info are the histograms of atoms counts and neighbor counts. If these vary widely across processors, you have a load-imbalance issue. This often results in inaccurate relative timing data, because processors have to wait when communication occurs for other processors to catch up. Thus the reported times for "Communication" or "Other" may be higher than they really are, due to load-imbalance. If this is an issue, you can uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile LAMMPS, to obtain synchronized timings.


5.2 General strategies

NOTE: this section 5.2 is still a work in progress

Here is a list of general ideas for improving simulation performance. Most of them are only applicable to certain models and certain bottlenecks in the current performance, so let the timing data you generate be your guide. It is hard, if not impossible, to predict how much difference these options will make, since it is a function of problem size, number of processors used, and your machine. There is no substitute for identifying performance bottlenecks, and trying out various options.

2-FFT PPPM, also called analytic differentiation or ad PPPM, uses 2 FFTs instead of the 4 FFTs used by the default ik differentiation PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to achieve the same accuracy as 4-FFT PPPM. For problems where the FFT cost is the performance bottleneck (typically large problems running on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.

Staggered PPPM performs calculations using two different meshes, one shifted slightly with respect to the other. This can reduce force aliasing errors and increase the accuracy of the method, but also doubles the amount of work required. For high relative accuracy, using staggered PPPM allows one to half the mesh size in each dimension as compared to regular PPPM, which can give around a 4x speedup in the kspace time. However, for low relative accuracy, using staggered PPPM gives little benefit and can be up to 2x slower in the kspace time. For example, the rhodopsin benchmark was run on a single processor, and results for kspace time vs. relative accuracy for the different methods are shown in the figure below. For this system, staggered PPPM (using ik differentiation) becomes useful when using a relative accuracy of slightly greater than 1e-5 and above.

IMPORTANT NOTE: Using staggered PPPM may not give the same increase in accuracy of energy and pressure as it does in forces, so some caution must be used if energy and/or pressure are quantities of interest, such as when using a barostat.


5.3 Packages with optimized styles

Accelerated versions of various pair_style, fixes, computes, and other commands have been added to LAMMPS, which will typically run faster than the standard non-accelerated versions, if you have the appropriate hardware on your system.

All of these commands are in packages. Currently, there are 6 such accelerator packages in LAMMPS, either as standard or user packages:

USER-CUDA for NVIDIA GPUs
GPU for NVIDIA GPUs as well as OpenCL support
USER-INTEL for Intel CPUs and Intel Xeon Phi
KOKKOS for GPUs, Intel Xeon Phi, and OpenMP threading
USER-OMP for OpenMP threading
OPT generic CPU optimizations

Any accelerated style has the same name as the corresponding standard style, except that a suffix is appended. Otherwise, the syntax for the command that specifies the style is identical, their functionality is the same, and the numerical results it produces should also be the same, except for precision and round-off effects.

For example, all of these styles are variants of the basic Lennard-Jones pair_style lj/cut:

Assuming LAMMPS was built with the appropriate package, these styles can be invoked by specifying them explicitly in your input script. Or the -suffix command-line switch can be used to automatically invoke the accelerated versions, without changing the input script. Use of the suffix command allows a suffix to be set explicitly and to be turned off and back on at various points within an input script.

To see what styles are currently available in each of the accelerated packages, see Section_commands 5 of the manual. The doc page for each indvidual style (e.g. pair lj/cut or fix nve) also lists any accelerated variants available for that style.

Here is a brief summary of what the various packages provide. Details are in individual sections below.

The following sections explain:

The final section compares and contrasts the USER-CUDA, GPU, and KOKKOS packages, since they all enable use of NVIDIA GPUs.


5.4 OPT package

The OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code.

Required hardware/software:

None.

Building LAMMPS with the OPT package:

Include the package and build LAMMPS.

make yes-opt
make machine 

No additional compile/link flags are needed in your lo-level src/MAKE/Makefile.machine.

Running with the OPT package:

You can explicitly add an "opt" suffix to the pair_style command in your input script:

pair_style lj/cut/opt 2.5 

Or you can run with the -sf command-line switch, which will automatically append "opt" to styles that support it.

lmp_machine -sf opt -in in.script
mpirun -np 4 lmp_machine -sf opt -in in.script 

Speed-ups to expect:

You should see a reduction in the "Pair time" value printed at the end of a run. On most machines for reasonable problem sizes, it will be a 5 to 20% savings.

Guidelines for best performance:

None. Just try out an OPT pair style to see how it performs.

Restrictions:

None.


5.5 USER-OMP package

The USER-OMP package was developed by Axel Kohlmeyer at Temple University. It provides multi-threaded versions of most pair styles, nearly all bonded styles (bond, angle, dihedral, improper), several Kspace styles, and a few fix styles. The package currently uses the OpenMP interface for multi-threading.

Required hardware/software:

Your compiler must support the OpenMP interface. You should have one or more multi-core CPUs so that multiple threads can be launched by an MPI task running on a CPU.

Building LAMMPS with the USER-OMP package:

Include the package and build LAMMPS.

cd lammps/src
make yes-user-omp
make machine 

Your lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support in both the CCFLAGS and LINKFLAGS variables. For GNU and Intel compilers, this flag is -fopenmp. Without this flag the USER-OMP styles will still be compiled and work, but will not support multi-threading.

Running with the USER-OMP package:

There are 3 issues (a,b,c) to address:

(a) Specify how many threads per MPI task to use

Note that the product of MPI tasks * threads/task should not exceed the physical number of cores, otherwise performance will suffer.

By default LAMMPS uses 1 thread per MPI task. If the environment variable OMP_NUM_THREADS is set to a valid value, this value is used. You can set this environment variable when you launch LAMMPS, e.g.

env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script 

or you can set it permanently in your shell's start-up script. All three of these examples use a total of 4 CPU cores.

Note that different MPI implementations have different ways of passing the OMP_NUM_THREADS environment variable to all MPI processes. The 2nd line above is for MPICH; the 3rd line with -x is for OpenMPI. Check your MPI documentation for additional details.

You can also set the number of threads per MPI task via the package omp command, which will override any OMP_NUM_THREADS setting.

(b) Enable the USER-OMP package

This can be done in one of two ways. Use a package omp command near the top of your input script.

Or use the "-sf omp" command-line switch, which will automatically invoke the command package omp *.

(c) Use OMP-accelerated styles

This can be done by explicitly adding an "omp" suffix to any supported style in your input script:

pair_style lj/cut/omp 2.5
fix nve/omp 

< Or you can run with the "-sf omp" command-line switch, which will automatically append "omp" to styles that support it.

lmp_machine -sf omp -in in.script
mpirun -np 4 lmp_machine -sf omp -in in.script 

Using the "suffix omp" command in your input script does the same thing.

Speed-ups to expect:

Depending on which styles are accelerated, you should look for a reduction in the "Pair time", "Bond time", "KSpace time", and "Loop time" values printed at the end of a run.

You may see a small performance advantage (5 to 20%) when running a USER-OMP style (in serial or parallel) with a single thread per MPI task, versus running standard LAMMPS with its standard (un-accelerated) styles (in serial or all-MPI parallelization with 1 task/core). This is because many of the USER-OMP styles contain similar optimizations to those used in the OPT package, as described above.

With multiple threads/task, the optimal choice of MPI tasks/node and OpenMP threads/task can vary a lot and should always be tested via benchmark runs for a specific simulation running on a specific machine, paying attention to guidelines discussed in the next sub-section.

A description of the multi-threading strategy used in the UESR-OMP package and some performance examples are presented here

Guidelines for best performance:

For many problems on current generation CPUs, running the USER-OMP package with a single thread/task is faster than running with multiple threads/task. This is because the MPI parallelization in LAMMPS is often more efficient than multi-threading as implemented in the USER-OMP package. The parallel efficiency (in a threaded sense) also varies for different USER-OMP styles.

Using multiple threads/task can be more effective under the following circumstances:

Other performance tips are as follows:

Restrictions:

None.


5.6 GPU package

The GPU package was developed by Mike Brown at ORNL and his collaborators, particularly Trung Nguyen (ORNL). It provides GPU versions of many pair styles, including the 3-body Stillinger-Weber pair style, and for kspace_style pppm for long-range Coulombics. It has the following general features:

Required hardware/software:

To use this package, you currently need to have an NVIDIA GPU and install the NVIDIA Cuda software on your system:

Building LAMMPS with the GPU package:

This requires two steps (a,b): build the GPU library, then build LAMMPS.

(a) Build the GPU library

The GPU library is in lammps/lib/gpu. Select a Makefile.machine (in lib/gpu) appropriate for your system. You should pay special attention to 3 settings in this makefile.

See lib/gpu/Makefile.linux.double for examples of the ARCH settings for different GPU choices, e.g. Fermi vs Kepler. It also lists the possible precision settings:

CUDA_PREC = -D_SINGLE_SINGLE  # Single precision for all calculations
CUDA_PREC = -D_DOUBLE_DOUBLE  # Double precision for all calculations
CUDA_PREC = -D_SINGLE_DOUBLE  # Accumulation of forces, etc, in double 

The last setting is the mixed mode referred to above. Note that your GPU must support double precision to use either the 2nd or 3rd of these settings.

To build the library, type:

make -f Makefile.machine 

If successful, it will produce the files libgpu.a and Makefile.lammps.

The latter file has 3 settings that need to be appropriate for the paths and settings for the CUDA system software on your machine. Makefile.lammps is a copy of the file specified by the EXTRAMAKE setting in Makefile.machine. You can change EXTRAMAKE or create your own Makefile.lammps.machine if needed.

Note that to change the precision of the GPU library, you need to re-build the entire library. Do a "clean" first, e.g. "make -f Makefile.linux clean", followed by the make command above.

(b) Build LAMMPS

cd lammps/src
make yes-gpu
make machine 

Note that if you change the GPU library precision (discussed above), you also need to re-install the GPU package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new GPU library.

Running with the GPU package:

The examples/gpu and bench/GPU directories have scripts that can be run with the GPU package, as well as detailed instructions on how to run them.

To run with the GPU package, there are 3 basic issues (a,b,c) to address:

(a) Use one or more MPI tasks per GPU

The total number of MPI tasks used by LAMMPS (one or multiple per compute node) is set in the usual manner via the mpirun or mpiexec commands, and is independent of the GPU package.

When using the GPU package, you cannot assign more than one physical GPU to a single MPI task. However multiple MPI tasks can share the same GPU, and in many cases it will be more efficient to run this way.

The default is to have all MPI tasks on a compute node use a single GPU. To use multiple GPUs per node, be sure to create one or more MPI tasks per GPU, and use the first/last settings in the package gpu command to include all the GPU IDs on the node. E.g. first = 0, last = 1, for 2 GPUs. On a node with 8 CPU cores and 2 GPUs, this would specify that each GPU is shared by 4 MPI tasks.

(b) Enable the GPU package

This can be done in one of two ways. Use a package gpu command near the top of your input script.

Or use the "-sf gpu" command-line switch, which will automatically invoke the command package gpu force/neigh 0 0 1. Note that this specifies use of a single GPU (per node), so you must specify the package command in your input script explicitly if you want to use multiple GPUs per node.

(c) Use GPU-accelerated styles

This can be done by explicitly adding a "gpu" suffix to any supported style in your input script:

pair_style lj/cut/gpu 2.5 

Or you can run with the "-sf gpu" command-line switch, which will automatically append "gpu" to styles that support it.

lmp_machine -sf gpu -in in.script
mpirun -np 4 lmp_machine -sf gpu -in in.script 

Using the "suffix gpu" command in your input script does the same thing.

IMPORTANT NOTE: The input script must also use the newton command with a pairwise setting of off, since on is the default.

Speed-ups to expect:

The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed).

See the Benchmark page of the LAMMPS web site for performance of the GPU package on various hardware, including the Titan HPC platform at ORNL.

You should also experiment with how many MPI tasks per GPU to use to give the best performance for your problem and machine. This is also a function of the problem size and the pair style being using. Likewise, you should experiment with the precision setting for the GPU library to see if single or mixed precision will give accurate results, since they will typically be faster.

Guidelines for best performance:

Restrictions:

None.


5.7 USER-CUDA package

The USER-CUDA package was developed by Christian Trott (Sandia) while at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions of many pair styles, many fixes, a few computes, and for long-range Coulombics via the PPPM command. It has the following general features:

Required hardware/software:

To use this package, you need to have an NVIDIA GPU and install the NVIDIA Cuda software on your system:

Your NVIDIA GPU needs to support Compute Capability 1.3. This list may help you to find out the Compute Capability of your card:

http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the corresponding GPU drivers. The Nvidia Cuda SDK is not required, but we recommend it also be installed. You can then make sure its sample projects can be compiled without problems.

Building LAMMPS with the USER-CUDA package:

This requires two steps (a,b): build the USER-CUDA library, then build LAMMPS.

(a) Build the USER-CUDA library

The USER-CUDA library is in lammps/lib/cuda. If your CUDA toolkit is not installed in the default system directoy /usr/local/cuda edit the file lib/cuda/Makefile.common accordingly.

To set options for the library build, type "make OPTIONS", where OPTIONS are one or more of the following. The settings will be written to the lib/cuda/Makefile.defaults and used when the library is built.

precision=N to set the precision level
  N = 1 for single precision (default)
  N = 2 for double precision
  N = 3 for positions in double precision
  N = 4 for positions and velocities in double precision
arch=M to set GPU compute capability
  M = 35 for Kepler GPUs
  M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
  M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
  M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
prec_timer=0/1 to use hi-precision timers
  0 = do not use them (default)
  1 = use them
  this is usually only useful for Mac machines 
dbg=0/1 to activate debug mode
  0 = no debug mode (default)
  1 = yes debug mode
  this is only useful for developers
cufft=1 for use of the CUDA FFT library
  0 = no CUFFT support (default)
  in the future other CUDA-enabled FFT libraries might be supported 

To build the library, simply type:

make 

If successful, it will produce the files libcuda.a and Makefile.lammps.

Note that if you change any of the options (like precision), you need to re-build the entire library. Do a "make clean" first, followed by "make".

(b) Build LAMMPS

cd lammps/src
make yes-user-cuda
make machine 

Note that if you change the USER-CUDA library precision (discussed above), you also need to re-install the USER-CUDA package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new USER-CUDA library.

Running with the USER-CUDA package:

The bench/CUDA directories has scripts that can be run with the USER-CUDA package, as well as detailed instructions on how to run them.

To run with the USER-CUDA package, there are 3 basic issues (a,b,c) to address:

(a) Use one MPI task per GPU

This is a requirement of the USER-CUDA package, i.e. you cannot use multiple MPI tasks per physical GPU. So if you are running on nodes with 1 or 2 GPUs, use the mpirun or mpiexec command to specify 1 or 2 MPI tasks per node.

If the nodes have more than 1 GPU, you must use the package cuda command near the top of your input script to specify that more than 1 GPU will be used (the default = 1).

(b) Enable the USER-CUDA package

The "-c on" or "-cuda on" command-line switch must be used when launching LAMMPS.

(c) Use USER-CUDA-accelerated styles

This can be done by explicitly adding a "cuda" suffix to any supported style in your input script:

pair_style lj/cut/cuda 2.5 

Or you can run with the "-sf cuda" command-line switch, which will automatically append "cuda" to styles that support it.

lmp_machine -sf cuda -in in.script
mpirun -np 4 lmp_machine -sf cuda -in in.script 

Using the "suffix cuda" command in your input script does the same thing.

Speed-ups to expect:

The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed).

See the Benchmark page of the LAMMPS web site for performance of the USER-CUDA package on different hardware.

Guidelines for best performance:

Restrictions:

None.


5.8 KOKKOS package

The KOKKOS package was developed primaritly by Christian Trott (Sandia) with contributions of various styles by others, including Sikandar Mashayak (UIUC). The underlying Kokkos library was written primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all Sandia).

The KOKKOS package contains versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library, which is included with LAMMPS in lib/kokkos.

The Kokkos library is part of Trilinos and is a templated C++ library that provides two key abstractions for an application like LAMMPS. First, it allows a single implementation of an application kernel (e.g. a pair style) to run efficiently on different kinds of hardware, such as a GPU, Intel Phi, or many-core chip.

The Kokkos library also provides data abstractions to adjust (at compile time) the memory layout of basic data structures like 2d and 3d arrays and allow the transparent utilization of special hardware load and store operations. Such data structures are used in LAMMPS to store atom coordinates or forces or neighbor lists. The layout is chosen to optimize performance on different platforms. Again this functionality is hidden from the developer, and does not affect how the kernel is coded.

These abstractions are set at build time, when LAMMPS is compiled with the KOKKOS package installed. This is done by selecting a "host" and "device" to build for, compatible with the compute nodes in your machine (one on a desktop machine or 1000s on a supercomputer).

All Kokkos operations occur within the context of an individual MPI task running on a single node of the machine. The total number of MPI tasks used by LAMMPS (one or multiple per compute node) is set in the usual manner via the mpirun or mpiexec commands, and is independent of Kokkos.

Kokkos provides support for two different modes of execution per MPI task. This means that computational tasks (pairwise interactions, neighbor list builds, time integration, etc) can be parallelized for one or the other of the two modes. The first mode is called the "host" and is one or more threads running on one or more physical CPUs (within the node). Currently, both multi-core CPUs and an Intel Phi processor (running in native mode) are supported. The second mode is called the "device" and is an accelerator chip of some kind. Currently only an NVIDIA GPU is supported. If your compute node does not have a GPU, then there is only one mode of execution, i.e. the host and device are the same.

Required hardware/software:

The KOKKOS package can be used to build and run LAMMPS on the following kinds of hardware configurations:

Intel Xeon Phi coprocessors are supported in "native" mode only.

Only NVIDIA GPUs are currently supported.

IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs, you must have Kepler generation GPUs (or later). The Kokkos library exploits texture cache options not supported by Telsa generation GPUs (or older).

To build the KOKKOS package for GPUs, NVIDIA Cuda software must be installed on your system. See the discussion above for the USER-CUDA and GPU packages for details of how to check and do this.

Building LAMMPS with the KOKKOS package:

Unlike other acceleration packages discussed in this section, the Kokkos library in lib/kokkos does not have to be pre-built before building LAMMPS itself. Instead, options for the Kokkos library are specified at compile time, when LAMMPS itself is built. This can be done in one of two ways, as discussed below.

Here are examples of how to build LAMMPS for the different compute-node configurations listed above.

CPU-only (run all-MPI or with OpenMP threading):

cd lammps/src
make yes-kokkos
make g++ OMP=yes 

Intel Xeon Phi:

cd lammps/src
make yes-kokkos
make g++ OMP=yes MIC=yes 

CPUs and GPUs:

cd lammps/src
make yes-kokkos
make cuda CUDA=yes 

These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the make command line which requires a GNU-compatible make command. Try "gmake" if your system's standard make complains.

IMPORTANT NOTE: If you build using make line variables and re-build LAMMPS twice with different KOKKOS options and the *same* target, e.g. g++ in the first two examples above, then you *must* perform a "make clean-all" or "make clean-machine" before each build. This is to force all the KOKKOS-dependent files to be re-compiled with the new options.

You can also hardwire these variables in the specified machine makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above, with a line like:

MIC = yes 

Note that if you build LAMMPS multiple times in this manner, using different KOKKOS options (defined in different machine makefiles), you do not have to worry about doing a "clean" in between. This is because the targets will be different.

IMPORTANT NOTE: The 3rd example above for a GPU, uses a different machine makefile, in this case src/MAKE/Makefile.cuda, which is included in the LAMMPS distribution. To build the KOKKOS package for a GPU, this makefile must use the NVIDA "nvcc" compiler. And it must have a CCFLAGS -arch setting that is appropriate for your NVIDIA hardware and installed software. Typical values for -arch are given in Section 2.3.4 of the manual, as well as other settings that must be included in the machine makefile, if you create your own.

There are other allowed options when building with the KOKKOS package. As above, They can be set either as variables on the make command line or in the machine makefile in the src/MAKE directory. See Section 2.3.4 of the manual for details.

IMPORTANT NOTE: Currently, there are no precision options with the KOKKOS package. All compilation and computation is performed in double precision.

Running with the KOKKOS package:

The examples/kokkos and bench/KOKKOS directories have scripts that can be run with the KOKKOS package, as well as detailed instructions on how to run them.

There are 3 issues (a,b,c) to address:

(a) Launching LAMMPS in different KOKKOS modes

Here are examples of how to run LAMMPS for the different compute-node configurations listed above.

Note that the -np setting for the mpirun command in these examples is for runs on a single node. To scale these examples up to run on a system with N compute nodes, simply multiply the -np setting by N.

CPU-only, dual hex-core CPUs:

mpirun -np 12 lmp_g++ -in in.lj      # MPI-only mode with no Kokkos
mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj      # MPI-only mode with Kokkos
mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj     # one MPI task, 12 threads
mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj      # two MPI tasks, 6 threads/task 

Intel Phi with 61 cores (240 total usable cores, with 4x hardware threading):

mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj      # 12*20 = 240
mpirun -np 15 lmp_g++ -k on t 16 -sf kk -in in.lj
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj 

Dual hex-core CPUs and a single GPU:

mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj       # one MPI task, 6 threads on CPU 

Dual 8-core CPUs and 2 GPUs:

mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # two MPI tasks, 8 threads per CPU 

(b) Enable the KOKKOS package

As illustrated above, the "-k on" or "-kokkos on" command-line switch must be used when launching LAMMPS.

As documented here, the command-line swithc allows for several options. Commonly used ones, as illustrated above, are:

(c) Use KOKKOS-accelerated styles

This can be done by explicitly adding a "kk" suffix to any supported style in your input script:

pair_style lj/cut/kk 2.5 

Or you can run with the "-sf kk" command-line switch, which will automatically append "kk" to styles that support it.

lmp_machine -sf kk -in in.script
mpirun -np 4 lmp_machine -sf kk -in in.script 

Using the "suffix kk" command in your input script does the same thing.

Speed-ups to expect:

The performance of KOKKOS running in different modes is a function of your hardware, which KOKKOS-enable styles are used, and the problem size.

Generally speaking, the following rules of thumb apply:

See the Benchmark page of the LAMMPS web site for performance of the KOKKOS package on different hardware.

Guidelines for best performance:

Here are guidline for using the KOKKOS package on the different hardware configurations listed above.

Many of the guidelines use the package kokkos command See its doc page for details and default settings. Experimenting with its options can provide a speed-up for specific calculations.

Running on a multi-core CPU:

If N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N, and should typically equal N. Note that the default threads/task is 1, as set by the "t" keyword of the -k command-line switch. If you do not change this, no additional parallelism (beyond MPI) will be invoked on the host CPU(s).

You can compare the performance running in different modes:

Examples of mpirun commands in these modes, for nodes with dual hex-core CPUs and no GPU, are shown above.

When using KOKKOS to perform multi-threading, it is important for performance to bind both MPI tasks to physical cores, and threads to physical cores, so they do not migrate during a simulation.

If you are not certain MPI tasks are being bound (check the defaults for your MPI installation), it can be forced with these flags:

OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... 

For binding threads with the KOKKOS OMP option, use thread affinity environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12 or later) setting the environment variable OMP_PROC_BIND=true should be sufficient. For binding threads with the KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as discussed in Section 2.3.4 of the manual.

Running on GPUs:

Insure the -arch setting in the machine makefile you are using, e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software (see this section of the manual for details).

The -np setting of the mpirun command should set the number of MPI tasks/node to be equal to the # of physical GPUs on the node.

Use the -kokkos command-line switch to specify the number of GPUs per node, and the number of threads per MPI task. As above for multi-core CPUs (and no GPU), if N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N. With one GPU (and one MPI task) it may be faster to use less than all the available cores, by setting threads/task to a smaller value. This is because using all the cores on a dual-socket node will incur extra cost to copy memory from the 2nd socket to the GPU.

Examples of mpirun commands that follow these rules, for nodes with dual hex-core CPUs and one or two GPUs, are shown above.

When using a GPU, you will achieve the best performance if your input script does not use any fix or compute styles which are not yet Kokkos-enabled. This allows data to stay on the GPU for multiple timesteps, without being copied back to the host CPU. Invoking a non-Kokkos fix or compute, or performing I/O for thermo or dump output will cause data to be copied back to the CPU.

You cannot yet assign multiple MPI tasks to the same GPU with the KOKKOS package. We plan to support this in the future, similar to the GPU package in LAMMPS.

You cannot yet use both the host (multi-threaded) and device (GPU) together to compute pairwise interactions with the KOKKOS package. We hope to support this in the future, similar to the GPU package in LAMMPS.

Running on an Intel Phi:

Kokkos only uses Intel Phi processors in their "native" mode, i.e. not hosted by a CPU.

As illustrated above, build LAMMPS with OMP=yes (the default) and MIC=yes. The latter insures code is correctly compiled for the Intel Phi. The OMP setting means OpenMP will be used for parallelization on the Phi, which is currently the best option within Kokkos. In the future, other options may be added.

Current-generation Intel Phi chips have either 61 or 57 cores. One core should be excluded for running the OS, leaving 60 or 56 cores. Each core is hyperthreaded, so there are effectively N = 240 (4*60) or N = 224 (4*56) cores to run on.

The -np setting of the mpirun command sets the number of MPI tasks/node. The "-k on t Nt" command-line switch sets the number of threads/task as Nt. The product of these 2 values should be N, i.e. 240 or 224. Also, the number of threads/task should be a multiple of 4 so that logical threads from more than one MPI task do not run on the same physical core.

Examples of mpirun commands that follow these rules, for Intel Phi nodes with 61 cores, are shown above.

Restrictions:

As noted above, if using GPUs, the number of MPI tasks per compute node should equal to the number of GPUs per compute node. In the future Kokkos will support assigning multiple MPI tasks to a single GPU.

Currently Kokkos does not support AMD GPUs due to limits in the available backend programming models. Specifically, Kokkos requires extensive C++ support from the Kernel language. This is expected to change in the future.


5.9 USER-INTEL package

The USER-INTEL package was developed by Mike Brown at Intel Corporation. It provides a capability to accelerate simulations by offloading neighbor list and non-bonded force calculations to Intel(R) Xeon Phi(TM) coprocessors. Additionally, it supports running simulations in single, mixed, or double precision with vectorization, even if a coprocessor is not present, i.e. on an Intel(R) CPU. The same C++ code is used for both cases. When offloading to a coprocessor, the routine is run twice, once with an offload flag.

The USER-INTEL package can be used in tandem with the USER-OMP package. This is useful when offloading pair style computations to coprocessors, so that other styles not supported by the USER-INTEL package, e.g. bond, angle, dihedral, improper, and long-range electrostatics, can be run simultaneously in threaded mode on CPU cores. Since less MPI tasks than CPU cores will typically be invoked when running with coprocessors, this enables the extra cores to be utilized for useful computation.

If LAMMPS is built with both the USER-INTEL and USER-OMP packages intsalled, this mode of operation is made easier to use, because the "-suffix intel" command-line switch or the suffix intel command will both set a second-choice suffix to "omp" so that styles from the USER-OMP package will be used if available, after first testing if a style from the USER-INTEL package is available.

Required hardware/software:

To use the offload option, you must have one or more Intel(R) Xeon Phi(TM) coprocessors.

Optimizations for vectorization have only been tested with the Intel(R) compiler. Use of other compilers may not result in vectorization or give poor performance.

Use of an Intel C++ compiler is reccommended, but not required. The compiler must support the OpenMP interface.

Building LAMMPS with the USER-INTEL package:

Include the package and build LAMMPS.

cd lammps/src
make yes-user-intel
make yes-user-omp (if desired)
make machine 

If the USER-OMP package is also installed, you can use styles from both packages, as described below.

The lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support in both the CCFLAGS and LINKFLAGS variables, which is -openmp for Intel compilers. You also need to add -DLAMMPS_MEMALIGN=64 and -restrict to CCFLAGS.

If you are compiling on the same architecture that will be used for the runs, adding the flag -xHost to CCFLAGS will enable vectorization with the Intel(R) compiler.

In order to build with support for an Intel(R) coprocessor, the flag -offload should be added to the LINKFLAGS line and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.

Note that the machine makefiles Makefile.intel and Makefile.intel_offload are included in the src/MAKE directory with options that perform well with the Intel(R) compiler. The latter file has support for offload to coprocessors; the former does not.

If using an Intel compiler, it is recommended that Intel(R) Compiler 2013 SP1 update 1 be used. Newer versions have some performance issues that are being addressed. If using Intel(R) MPI, version 5 or higher is recommended.

Running with the USER-INTEL package:

The examples/intel directory has scripts that can be run with the USER-INTEL package, as well as detailed instructions on how to run them.

Note that the total number of MPI tasks used by LAMMPS (one or multiple per compute node) is set in the usual manner via the mpirun or mpiexec commands, and is independent of the USER-INTEL package.

To run with the USER-INTEL package, there are 3 basic issues (a,b,c) to address:

(a) Specify how many threads per MPI task to use on the CPU.

Whether using the USER-INTEL package to offload computations to Intel(R) Xeon Phi(TM) coprocessors or not, work performed on the CPU can be multi-threaded via the USER-OMP package, assuming the USER-OMP package was also installed when LAMMPS was built.

In this case, the instructions above for the USER-OMP package, in its "Running with the USER-OMP package" sub-section apply here as well.

You can specify the number of threads per MPI task via the OMP_NUM_THREADS environment variable or the package omp command. The product of MPI tasks * threads/task should not exceed the physical number of cores on the CPU (per node), otherwise performance will suffer.

Note that the threads per MPI task setting is completely independent of the number of threads used on the coprocessor. Only the package intel command can be used to control thread counts on the coprocessor.

(b) Enable the USER-INTEL package

This can be done in one of two ways. Use a package intel command near the top of your input script.

Or use the "-sf intel" command-line switch, which will automatically invoke the command "package intel * mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240". Note that this specifies mixed precision and use of a single Xeon Phi(TM) coprocessor (per node), so you must specify the package command in your input script explicitly if you want a different precision or to use multiple Phi coprocessor per node. Also note that the balance and offload keywords are ignored if you did not build LAMMPS with offload support for a coprocessor, as descibed above.

(c) Use USER-INTEL-accelerated styles

This can be done by explicitly adding an "intel" suffix to any supported style in your input script:

pair_style lj/cut/intel 2.5 

Or you can run with the "-sf intel" command-line switch, which will automatically append "intel" to styles that support it.

lmp_machine -sf intel -in in.script
mpirun -np 4 lmp_machine -sf intel -in in.script 

Using the "suffix intel" command in your input script does the same thing.

IMPORTANT NOTE: Using an "intel" suffix in any of the above modes, actually invokes two suffixes, "intel" and "omp". "Intel" is tried first, and if the style does not support it, "omp" is tried next. If neither is supported, the default non-suffix style is used.

Speed-ups to expect:

If LAMMPS was not built with coprocessor support when including the USER-INTEL package, then acclerated styles will run on the CPU using vectorization optimizations and the specified precision. This may give a substantial speed-up for a pair style, particularly if mixed or single precision is used.

If LAMMPS was built with coproccesor support, the pair styles will run on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The performance of a Xeon Phi versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/coprocessor, and the precision used on the coprocessor (double, single, mixed).

See the Benchmark page of the LAMMPS web site for performance of the USER-INTEL package on different hardware.

Guidelines for best performance on an Intel(R) Xeon Phi(TM) coprocessor:

Restrictions:

When offloading to a coprocessor, hybrid styles that require skip lists for neighbor builds cannot be offloaded. Using hybrid/overlay is allowed. Only one intel accelerated style may be used with hybrid styles. Special_bonds exclusion lists are not currently supported with offload, however, the same effect can often be accomplished by setting cutoffs for excluded atom types to 0. None of the pair styles in the USER-INTEL package currently support the "inner", "middle", "outer" options for rRESPA integration via the run_style respa command; only the "pair" option is supported.


5.10 Comparison of GPU and USER-CUDA and KOKKOS packages

All 3 of these packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways.

NOTE: this section still needs to be re-worked with additional KOKKOS information.

As a consequence, for a particular simulation on specific hardware, one package may be faster than the other. We give guidelines below, but the best way to determine which package is faster for your input script is to try both of them on your machine. See the benchmarking section below for examples where this has been done.

Guidelines for using each package optimally:

Differences between the two packages:

Examples:

The LAMMPS distribution has two directories with sample input scripts for the GPU and USER-CUDA packages.

These contain input scripts for identical systems, so they can be used to benchmark the performance of both packages on your system.