Syntax:
package style args
cuda args = keyword value ... one or more keyword/value pairs may be appended keywords = gpu/node or gpu/node/special or timing or test or override/bpa gpu/node value = N N = number of GPUs to be used per node gpu/node/special values = N gpu1 .. gpuN N = number of GPUs to be used per node gpu1 .. gpuN = N IDs of the GPUs to use timing values = none test values = id id = atom-ID of a test particle override/bpa values = flag flag = 0 for TpA algorithm, 1 for BpA algorithm gpu args = mode first last split keyword value ... mode = force or force/neigh first = ID of first GPU to be used on each node last = ID of last GPU to be used on each node split = fraction of particles assigned to the GPU zero or more keyword/value pairs may be appended keywords = threads_per_atom or cellsize or device threads_per_atom value = Nthreads Nthreads = # of GPU threads used per atom cellsize value = dist dist = length (distance units) in each dimension for neighbor bins device value = device_type device_type = kepler or fermi or cypress or generic intel args = Nthreads precision keyword value ... Nthreads = # of OpenMP threads to associate with each MPI process on host precision = single or mixed or double keywords = balance or offload_cards or offload_ghost or offload_tpc or offload_threads balance value = split split = fraction of work to offload to coprocessor, -1 for dynamic offload_cards value = ncops ncops = number of coprocessors to use on each node offload_ghost value = offload_type offload_type = 1 to include ghost atoms for offload, 0 for local only offload_tpc value = tpc tpc = number of threads to use on each core of coprocessor offload_threads value = tptask tptask = max number of threads to use on coprocessor for each MPI task kokkos args = keyword value ... one or more keyword/value pairs may be appended keywords = neigh or comm/exchange or comm/forward neigh value = full or half/thread or half or n2 or full/cluster comm/exchange value = no or host or device comm/forward value = no or host or device omp args = Nthreads mode Nthreads = # of OpenMP threads to associate with each MPI process mode = force or force/neigh (optional)
Examples:
package gpu force 0 0 1.0 package gpu force 0 0 0.75 package gpu force/neigh 0 0 1.0 package gpu force/neigh 0 1 -1.0 package cuda gpu/node/special 2 0 2 package cuda test 3948 package kokkos neigh half/thread comm/forward device package omp * force/neigh package omp 4 force package intel * mixed balance -1
Description:
This command invokes package-specific settings. Currently the following packages use it: USER-CUDA, GPU, USER-INTEL, KOKKOS, and USER-OMP.
To use the accelerated GPU and USER-OMP styles, the use of the package command is required. However, as described in the "Defaults" section below, if you use the "-sf gpu" or "-sf omp" command-line options to enable use of these styles, then default package settings are enabled. In that case you only need to use the package command if you want to change the defaults.
To use the accelerated USER-CUDA and KOKKOS styles, the package command is not required as defaults are assigned internally. You only need to use the package command if you want to change the defaults.
See Section_accelerate of the manual for more details about using these various packages for accelerating LAMMPS calculations.
The cuda style invokes options associated with the use of the USER-CUDA package.
The gpu/node keyword specifies the number N of GPUs to be used on each node. An MPI process with rank K will use the GPU (K mod N). This implies that processes should be assigned with successive ranks on each node, which is the default with most (or even all) MPI implementations. The default value for N is 2.
The gpu/node/special keyword also specifies the number (N) of GPUs to be used on each node, but allows more control over their specification. An MPI process with rank K will use the GPU gpuI with l = (K mod N) + 1. This implies that processes should be assigned with successive ranks on each node, which is the default with most (or even all) MPI implementations. For example if you have three GPUs on a machine, one of which is used for the X-Server (the GPU with the ID 1) while the others (with IDs 0 and 2) are used for computations you would specify:
package cuda gpu/node/special 2 0 2
A main purpose of the gpu/node/special optoin is to allow two (or more) simulations to be run on one workstation. In that case one would set the first simulation to use GPU 0 and the second to use GPU 1. This is not necessary though, if the GPUs are in what is called compute exclusive mode. Using that setting, every process will get its own GPU automatically. This compute exclusive mode can be set as root using the nvidia-smi tool which is part of the CUDA installation.
Note that if the gpu/node/special keyword is not used, the USER-CUDA package sorts existing GPUs on each node according to their number of multiprocessors. This way, compute GPUs will be priorized over X-Server GPUs.
Use of the timing keyword will output detailed timing information for various subroutines.
The test keyword will output info for the the specified atom at several points during each time step. This is mainly usefull for debugging purposes. Note that the simulation will be severly slowed down if this option is used.
The override/bpa keyword can be used to specify which mode is used for pair-force evaluation. TpA = one thread per atom; BpA = one block per atom. If this keyword is not used, a short test at the begin of each run will determine which method is more effective (the result of this test is part of the LAMMPS output). Therefore it is usually not necessary to use this keyword.
The gpu style invokes options associated with the use of the GPU package.
The mode setting specifies where neighbor list calculations will be performed. If mode is force, neighbor list calculation is performed on the CPU. If mode is force/neigh, neighbor list calculation is performed on the GPU. GPU neighbor list calculation currently cannot be used with a triclinic box. GPU neighbor list calculation currently cannot be used with hybrid pair styles. GPU neighbor lists are not compatible with styles that are not GPU-enabled. When a non-GPU enabled style requires a neighbor list, it will also be built using CPU routines. In these cases, it will typically be more efficient to only use CPU neighbor list builds.
The first and last settings specify the GPUs that will be used for simulation. On each node, the GPU IDs in the inclusive range from first to last will be used.
The split setting can be used for load balancing force calculation work between CPU and GPU cores in GPU-enabled pair styles. If 0 < split < 1.0, a fixed fraction of particles is offloaded to the GPU while force calculation for the other particles occurs simulataneously on the CPU. If split<0, the optimal fraction (based on CPU and GPU timings) is calculated every 25 timesteps. If split = 1.0, all force calculations for GPU accelerated pair styles are performed on the GPU. In this case, hybrid, bond, angle, dihedral, improper, and long-range calculations can be performed on the CPU while the GPU is performing force calculations for the GPU-enabled pair style. If all CPU force computations complete before the GPU, LAMMPS will block until the GPU has finished before continuing the timestep.
As an example, if you have two GPUs per node and 8 CPU cores per node, and would like to run on 4 nodes (32 cores) with dynamic balancing of force calculation across CPU and GPU cores, you could specify
package gpu force/neigh 0 1 -1
In this case, all CPU cores and GPU devices on the nodes would be utilized. Each GPU device would be shared by 4 CPU cores. The CPU cores would perform force calculations for some fraction of the particles at the same time the GPUs performed force calculation for the other particles.
The threads_per_atom keyword allows control of the number of GPU threads used per-atom to perform the short range force calculation. By default, the value will be chosen based on the pair style, however, the value can be set with this keyword to fine-tune performance. For large cutoffs or with a small number of particles per GPU, increasing the value can improve performance. The number of threads per atom must be a power of 2 and currently cannot be greater than 32.
The cellsize keyword can be used to control the size of the cells used for binning atoms in neighbor list calculations. Setting this value is normally not needed; the optimal value is close to the default (equal to the cutoff distance for the short range interactions plus the neighbor skin). GPUs can perform efficiently with much larger cutoffs than CPUs and this can be used to reduce the time required for long-range calculations or in some cases to eliminate them with models such as coul/wolf or coul/dsf. For very large cutoffs, it can be more efficient to use smaller values for cellsize in parallel simulations. For example, with a cutoff of 20*sigma and a neighbor skin of sigma, a cellsize of 5.25*sigma can be efficient for parallel simulations.
The device keyword can be used to tune parameters to optimize for a specific accelerator when using OpenCL. For CUDA, the device keyword is ignored. Currently, the device type is limited to NVIDIA Kepler, NVIDIA Fermi, AMD Cypress, or a generic device. More devices will be added soon. The default device type can be specified when building LAMMPS with the GPU library.
The intel style invokes options associated with the use of the USER-INTEL package.
The Nthreads argument allows to one explicitly set the number of OpenMP threads to be allocated for each MPI process, An Nthreads value of '*' instructs LAMMPS to use whatever is the default for the given OpenMP environment. This is usually determined via the OMP_NUM_THREADS environment variable or the compiler runtime.
The precision argument determines the precision mode to use and can take values of single (intel styles use single precision for all calculations), mixed (intel styles use double precision for accumulation and storage of forces, torques, energies, and virial terms and single precision for everything else), or double (intel styles use double precision for all calculations).
Additional keyword-value pairs are available that are used to determine how work is offloaded to an Intel(R) coprocessor. If LAMMPS is built without offload support, these values are ignored. The additional settings are as follows:
The balance setting is used to set the fraction of work offloaded to the coprocessor for an intel style (in the inclusive range 0.0 to 1.0). While this fraction of work is running on the coprocessor, other calculations will run on the host, including neighbor and pair calculations that are not offloaded, angle, bond, dihedral, kspace, and some MPI communications. If the balance is set to -1, the fraction of work is dynamically adjusted automatically throughout the run. This can typically give performance within 5 to 10 percent of the optimal fixed fraction.
The offload_cards setting determines the number of coprocessors to use on each node.
Additional options for fine tuning performance with offload are as follows:
The offload_ghost setting determines whether or not ghost atoms, atoms at the borders between MPI tasks, are offloaded for neighbor and force calculations. When set to "0", ghost atoms are not offloaded. This option can reduce the amount of data transfer with the coprocessor and also can overlap MPI communication of forces with computation on the coprocessor when the newton pair setting is "on". When set to "1", ghost atoms are offloaded. In some cases this can provide better performance, especially if the offload fraction is high.
The offload_tpc option sets the maximum number of threads that will run on each core of the coprocessor.
The offload_threads option sets the maximum number of threads that will be used on the coprocessor for each MPI task. This, along with the offload_tpc setting, are the only methods for changing the number of threads on the coprocessor. The OMP_NUM_THREADS keyword and Nthreads options are only used for threads on the host.
The kokkos style invokes options associated with the use of the KOKKOS package.
The neigh keyword determines what kinds of neighbor lists are built. A value of half uses half-neighbor lists, the same as used by most pair styles in LAMMPS. A value of half/thread uses a threadsafe variant of the half-neighbor list. It should be used instead of half when running with threads on a CPU. A value of full uses a full-neighborlist, i.e. f_ij and f_ji are both calculated. This performs twice as much computation as the half option, however that can be a win because it is threadsafe and doesn't require atomic operations. A value of full/cluster is an experimental neighbor style, where particles interact with all particles within a small cluster, if at least one of the clusters particles is within the neighbor cutoff range. This potentially allows for better vectorization on architectures such as the Intel Phi. If also reduces the size of the neighbor list by roughly a factor of the cluster size, thus reducing the total memory footprint considerably.
The comm/exchange and comm/forward keywords determine whether the host or device performs the packing and unpacking of data when communicating information between processors. "Exchange" communication happens only on timesteps that neighbor lists are rebuilt. The data is only for atoms that migrate to new processors. "Forward" communication happens every timestep. The data is for atom coordinates and any other atom properties that needs to be updated for ghost atoms owned by each processor.
The value options for these keywords are no or host or device. A value of no means to use the standard non-KOKKOS method of packing/unpacking data for the communication. A value of host means to use the host, typically a multi-core CPU, and perform the packing/unpacking in parallel with threads. A value of device means to use the device, typically a GPU, to perform the packing/unpacking operation.
The optimal choice for these keywords depends on the input script and the hardware used. The no value is useful for verifying that Kokkos code is working correctly. It may also be the fastest choice when using Kokkos styles in MPI-only mode (i.e. with a thread count of 1). When running on CPUs or Xeon Phi, the host and device values work identically. When using GPUs, the device value will typically be optimal if all of your styles used in your input script are supported by the KOKKOS package. In this case data can stay on the GPU for many timesteps without being moved between the host and GPU, if you use the device value. This requires that your MPI is able to access GPU memory directly. Currently that is true for OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses styles (e.g. fixes) which are not yet supported by the KOKKOS package, then data has to be move between the host and device anyway, so it is typically faster to let the host handle communication, by using the host value. Using host instead of no will enable use of multiple threads to pack/unpack communicated data.
The omp style invokes options associated with the use of the USER-OMP package.
The first argument allows to explicitly set the number of OpenMP threads to be allocated for each MPI process. For example, if your system has nodes with dual quad-core processors, it has a total of 8 cores per node. You could run MPI on 2 cores on each node (e.g. using options for the mpirun command), and set the Nthreads setting to 4. This would effectively use all 8 cores on each node. Since each MPI process would spawn 4 threads (one of which runs as part of the MPI process itself).
For performance reasons, you should not set Nthreads to more threads than there are physical cores (per MPI task), but LAMMPS cannot check for this.
An Nthreads value of '*' instructs LAMMPS to use whatever is the default for the given OpenMP environment. This is usually determined via the OMP_NUM_THREADS environment variable or the compiler runtime. Please note that in most cases the default for OpenMP capable compilers is to use one thread for each available CPU core when OMP_NUM_THREADS is not set, which can lead to extremely bad performance.
Which combination of threads and MPI tasks gives the best performance is difficult to predict and can depend on many components of your input. Not all features of LAMMPS support OpenMP and the parallel efficiency can be very different, too.
The mode setting specifies where neighbor list calculations will be multi-threaded as well. If mode is force, neighbor list calculation is performed in serial. If mode is force/neigh, a multi-threaded neighbor list build is used. Using the force/neigh setting is almost always faster and should produce idential neighbor lists at the expense of using some more memory (neighbor list pages are always allocated for all threads at the same time and each thread works on its own pages).
Restrictions:
This command cannot be used after the simulation box is defined by a read_data or create_box command.
The cuda style of this command can only be invoked if LAMMPS was built with the USER-CUDA package. See the Making LAMMPS section for more info.
The gpu style of this command can only be invoked if LAMMPS was built with the GPU package. See the Making LAMMPS section for more info.
The kk style of this command can only be invoked if LAMMPS was built with the KOKKOS package. See the Making LAMMPS section for more info.
The omp style of this command can only be invoked if LAMMPS was built with the USER-OMP package. See the Making LAMMPS section for more info.
Related commands:
Default:
The default settings for the USER-CUDA package are "package cuda gpu 2". This is the case whether the "-sf cuda" command-line switch is used or not.
If the "-sf gpu" command-line switch is used then it is as if the command "package gpu force/neigh 0 0 1" were invoked, to specify default settings for the GPU package. If the command-line switch is not used, then no defaults are set, and you must specify the appropriate package command in your input script.
The default settings for the USER-INTEL package are "package intel * mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240". The offload_ghost default setting is determined by the intel style being used. The value used is output to the screen in the offload report at the end of each run.
The default settings for the KOKKOS package are "package kk neigh full comm/exchange host comm/forward host". This is the case whether the "-sf kk" command-line switch is used or not.
If the "-sf omp" command-line switch is used then it is as if the command "package omp *" were invoked, to specify default settings for the USER-OMP package. If the command-line switch is not used, then no defaults are set, and you must specify the appropriate package command in your input script.