Commit d3ddd645 authored by incardon's avatar incardon

Adding missing examples

parent 543b79c9
include ../../example.mk
CUDA_CC=
ifeq (, $(shell which nvcc))
CUDA_CC=mpic++ -x c++ $(INCLUDE_PATH)
INCLUDE_PATH_NVCC=
else
CUDA_CC=nvcc
endif
CC=mpic++
OBJ = main.o
gpu_fstep:
%.o: %.cu
$(CUDA_CC) -O3 -g -c --std=c++11 -o $@ $< $(INCLUDE_PATH_NVCC)
%.o: %.cpp
$(CC) -O3 $(OPT) -g -c --std=c++11 -o $@ $< $(INCLUDE_PATH)
gpu_fstep: $(OBJ)
$(CC) -o $@ $^ $(CFLAGS) $(LIBS_PATH) $(LIBS)
all: gpu_fstep
run: gpu_fstep
mpirun -np 2 ./gpu_fstep
.PHONY: clean all run
clean:
rm -f *.o *~ core gpu_fstep
[pack]
files = main.cu Makefile
/*! \page Vector_1_gpu_first_step Vector 1 GPU first step
*
*
* [TOC]
*
*
* # GPU first steps # {#GPU_first_steps}
*
*
* This example shows how to use GPUs in OpenFPM step by step. To start to use GPUs with vectors_dist, this example is a good
* starting point. On the other hand we suggest to read the example \ref{simple_vector_example} before this example.
*
* ## Data structures {#GPU_data_structures}
*
* While a cpu data-structure can be created with vector_dist a gpu data-structure can be created with vector_dist_gpu.
* The GPU vector_dist expose the same CPU interface with additional functionalities. This means that a vector_dist can be
* changed to vector_dist_gpu without changing a single line of code. This is an important feature because give us the possibility
* to change our code from CPU to GPU incrementally step-by-step. A small sections of code can be moved to GPU leaving the rest
* unchanged. The documentation of vector_dist_gpu is the same ad vector_dist, the extended functionality is documented in
* vector_dist. Every file containing vector_dist_gpu must be compiled with nvcc compiler, so we suggest to use the extension
* *.cu for such files and add a rule to compile cu files using nvcc. An example of how to do it can be seen checking the Makefile
* in this example.
*
* While changing from vector_dist to vector_dist_gpu, seem not producing any effect. There are some underling change that take
* effect:
*
* * The internal memory used to allocate memory is not anymore simple heap memory, but is CUDA pinned host memory.
* * Buffers are allocated for each property.
* * Vector properties like float[3] and float[3][3] has a different layout in memory from the standard (It will be
* clear how and why later in the example)
*
* This code snippet below shows how vector_dist_gpu can be used like vector_dist in \ref{simple_vector_example}.
* In short we create a set of 100 particles (vector_dist_gpu) in 2d from {0.0,0.0} to {1.0,1.0}. Particles are
* randomly placed in such space. The final map redistribute such particles accordingly to the decomposition.
*
* \snippet Vector/1_gpu_first_step/main.cu cpu_like_gpu
*
* ### Offload data to GPU ###
*
* To offload data to GPU you can use the function hostToDevicePos to offload particle position and hostToDeviceProp to offload
* properties data. This latest function take template parameters to specify which properties to offload. Here we offload all
* the properties (scalar,vector,tensor)
*
* \snippet Vector/1_gpu_first_step/main.cu offload_pos_prop
*
* Once the data has been offload we can launch a kernel to do some computation. All data-structure gpu ready has a function
* called toKernel that give the possibility to pass the data-structure to the kernel and be used inside the kernel, like is
* used on CPU. Lanuching a kernel on cuda require the subdivision of of a loop in workgroups and threads in the workgroups.
* OpenFPM provides the function getDomainIteratorGPU to automatically split the domain particle loop. The members wthr and
* thr can be used in the <<<...>>> brachet to launch a CUDA kernel.
*
* \snippet Vector/1_gpu_first_step/main.cu launch_domain_it
*
* The kernel is the definition of a normal CUDA kernel. We use template parameters for every parameter that is passed with toKernel()
*
* \note The definition of the arguments toKernel() as template parameter give us the possibility to use the template engine
* to do type deduction and avoid to specify the real-type returned by toKernel()
*
* The kernel simply shift the particles by 0.05. Set the scalar properties to the sum of x and y of the "old" particle position,
* set the vector properties to the old particle position, and set the tensor to several combination of x and y "old" particle
* position
*
* \snippet Vector/1_gpu_first_step/main.cu kernel_translate_fill_prop
*
* Once the computation is completed we can ask to reoffload the data from device to host and write the results to file.
*
* \note Every file writer requires that the data are offloaded on host memory
*
* \snippet Vector/1_gpu_first_step/main.cu device_to_host_write
*
* \htmlonly
* <img src="http://openfpm.mpi-cbg.de/web/images/examples/1_gpu_first_step/output.png"/>
* \endhtmlonly
*
* ## map and ghost_get for multi GPU
*
* Until here we saw how to move data from host to device, device to host and how to launch a CUDA kernel on off-loaded data.
* As previously mentioned vector_dist_gpu has the same CPU interface and so provide the standard function map and ghost_get that work
* on host pinned memory. Because we want to avoid to move data from GPU to host memory. To avoid it we can use map with the option
* RUN_DEVICE to redistribute the particles directly on GPU, and ghost_get with RUN_DEVICE to fill ghost particles directly on GPU.
* In the loop below we see how we can use map on a particle set that is already on GPU. In particular we never offload particles on CPU
* to do map or ghost_get. We use the kernel translate_fill_prop, to translate the particles and update the properties. The only offload
* happen every 10 time-step to write on file.
*
* \snippet Vector/1_gpu_first_step/main.cu map_and_ghost_get_on_gpu
*
* ## RDMA on MPI with CUDA
*
* Today MPI implementations are able to do RDMA on GPU memory. This in practice mean that Infiniband card can directly read
* GPU memory transfer over infiniband and write on the other node directly on GPU, without moving the data to system memory.
* In practice means that MPI calls can work directly on CUDA device pointers. OpenFPM can exploit this feature if MPI is compiled
* with CUDA support. To check if MPI is compiled with CUDA support use the function \b is_mpi_rdma_cuda_active() \b
*
* \snippet Vector/1_gpu_first_step/main.cu performance_rdma
*
* It is good to note that in order to work (return true), some condition must be met.
*
* * Because at the moment OpenFPM sense OpenMPI CUDA aware implementation we must define the \b OPENMPI \b macro
* \snippet Vector/1_gpu_first_step/main.cu using_openmpi
*
* * MPI must be compiled with CUDA support (in general installing OpenFPM with -g should attempt to install OpenMPI with CUDA support)
*
* ## Full code ## {#code_e0_sim}
*
* \include Vector/1_gpu_first_step/main.cu
*
*/
#ifdef __NVCC__
//! \cond [using_openmpi] \endcond
#define OPENMPI
//! \cond [using_openmpi] \endcond
#include "Vector/vector_dist.hpp"
//! \cond [kernel_translate_fill_prop] \endcond
template<typename vector_type>
__global__ void translate_fill_prop(vector_type vd)
{
auto p = GET_PARTICLE(vd);
vd.template getProp<0>(p) = vd.getPos(p)[0] + vd.getPos(p)[1];
vd.template getProp<1>(p)[0] = vd.getPos(p)[0];
vd.template getProp<1>(p)[1] = vd.getPos(p)[1];
vd.template getProp<2>(p)[0][0] = vd.getPos(p)[0];
vd.template getProp<2>(p)[0][1] = vd.getPos(p)[1];
vd.template getProp<2>(p)[1][0] = vd.getPos(p)[0] + vd.getPos(p)[1];
vd.template getProp<2>(p)[1][1] = vd.getPos(p)[1] - vd.getPos(p)[0];
vd.getPos(p)[0] += 0.01f;
vd.getPos(p)[1] += 0.01f;
}
//! \cond [kernel_translate_fill_prop] \endcond
int main(int argc, char* argv[])
{
//! \cond [cpu_like_gpu] \endcond
// initialize the library
openfpm_init(&argc,&argv);
// Here we define our domain a 2D box with internals from 0 to 1.0 for x and y
Box<2,float> domain({0.0,0.0},{1.0,1.0});
// Here we define the boundary conditions of our problem
size_t bc[2]={PERIODIC,PERIODIC};
// extended boundary around the domain, and the processor domain
Ghost<2,float> g(0.05);
vector_dist_gpu<2,float, aggregate<float,float[2],float[2][2]> > vd(100,domain,bc,g);
// the scalar is the element at position 0 in the aggregate
const int scalar = 0;
// the vector is the element at position 1 in the aggregate
const int vector = 1;
// the tensor is the element at position 2 in the aggregate
const int tensor = 2;
auto it = vd.getDomainIterator();
while (it.isNext())
{
auto key = it.get();
// we define x, assign a random position between 0.0 and 1.0
vd.getPos(key)[0] = (float)rand() / RAND_MAX;
// we define y, assign a random position between 0.0 and 1.0
vd.getPos(key)[1] = (float)rand() / RAND_MAX;
// next particle
++it;
}
vd.map();
//! \cond [cpu_like_gpu] \endcond
//! \cond [offload_pos_prop] \endcond
vd.hostToDevicePos();
vd.template hostToDeviceProp<scalar,vector,tensor>();
//! \cond [offload_pos_prop] \endcond
//! \cond [launch_domain_it] \endcond
auto ite = vd.getDomainIteratorGPU();
translate_fill_prop<<<ite.wthr,ite.thr>>>(vd.toKernel());
//! \cond [launch_domain_it] \endcond
//! \cond [device_to_host_write] \endcond
vd.deviceToHostPos();
vd.deviceToHostProp<0,1,2>();
// We write on a file
vd.write("output");
//! \cond [device_to_host_write] \endcond
//! \cond [map_and_ghost_get_on_gpu] \endcond
for (int j = 0 ; j < 100 ; j++)
{
auto ite = vd.getDomainIteratorGPU();
translate_fill_prop<<<ite.wthr,ite.thr>>>(vd.toKernel());
vd.map(RUN_ON_DEVICE);
vd.template ghost_get<0,1,2>(RUN_ON_DEVICE);
if ( j % 10 == 0)
{
// offload to host
vd.deviceToHostPos();
vd.template deviceToHostProp<0,1,2>();
// write
vd.write_frame("output_f",j);
}
}
//! \cond [map_and_ghost_get_on_gpu] \endcond
//! \cond [performance_rdma] \endcond
bool active = is_mpi_rdma_cuda_active();
std::cout << "Is MPI rdma active on CUDA " << active << std::endl;
//! \cond [performance_rdma] \endcond
openfpm_finalize();
}
#else
int main(int argc, char* argv[])
{
return 0;
}
#endif
include ../../example.mk
### This is a trick to avoid "Command not found if you no not have NVCC compiler". In practice the normal C++ compiler is used
### internally the example disable with the preprocessor its code if not compiled with nvcc
CUDA_CC=
CUDA_CC_LINK=
ifeq (, $(shell which nvcc))
CUDA_CC=mpic++ -x c++ $(INCLUDE_PATH)
INCLUDE_PATH_NVCC=
CUDA_CC_LINK=mpic++
else
CUDA_CC=nvcc
CUDA_CC_LINK=nvcc
endif
CC=mpic++
LDIR=
OPT=
OBJ = main.o
all: md_dyn
test: md_dyn_test
md_dyn_test: OPT += -DTEST_RUN
md_dyn_test: md_dyn
%.o: %.cu
$(CUDA_CC) $(OPT) -O3 -g -c --std=c++11 -o $@ $< $(INCLUDE_PATH_NVCC)
md_dyn: $(OBJ)
$(CUDA_CC_LINK) -o $@ $^ $(LIBS_PATH) $(LIBS)
run: md_dyn_test
mpirun -np 3 ./md_dyn;
.PHONY: clean all run
clean:
rm -f *.o *~ core md_dyn md_dyn_expr md_dyn_vl
/*!
* \page Vector_3_md_dyn_gpu Vector 3 molecular dynamic on GPU
*
* [TOC]
*
* # Molecular Dynamic with Lennard-Jones potential on GPU {#e3_md_gpu}
*
* This Molecular dynamic simulation is exactly the same as \ref e3_md , with the difference that it work on Nvidia CUDA GPU and
* for this reason we use one milions particles.
* Before starting with this example, we suggest to start from the example \ref GPU_first_steps and \ref e3_md .
*
* \htmlonly
* <img src="http://openfpm.mpi-cbg.de/web/images/examples/3_md_gpu/md_4_GPU.png"/>
* \endhtmlonly
*
* # CUDA Kernels {#e3_cuda_ker_gpu}
*
* In the CPU example we have 4 loops
*
* * One loop to calculate forces
* * One loop to do position and velocity integration as first step of the Verlet integration
* * One loop to do velocity integration as second step of the Verlet integration
* * One loop to calculate the total energy of the system
*
* All these loops must be converted into CUDA Kernels and launch the cuda kernel.
*
* ## Calculate forces {#e3_calc_force_gpu}
*
* The calculate forces function now has a call to the function \b getDomainIteratorGPU() \b, this function is equivalent
* to getDomainIterator(). What it return is struct with a convenient division in workgroups (\b wthr \b) and threads
* (\b thr \b) in a workgroups, to iterate across all the domain particles. The cuda kernel launch is a standard NVIDIA
* CUDA launch kernel, where if we want to pass the distributed data-structure to the kernel we have to remember to use the
* function toKernel().
*
* \snippet Vector/3_molecular_dynamic_gpu/main.cu calc_forces
*
* The kernel is templated on all the parameters passed with toKernel(). This is not strictly necessary but is convenient because
* we let the compiler to infer the type returned by toKernel() without write the argument type explicitly. To get the particle index
* we use the macro GET_PARTICLE this macro simply retrieve the particle id from the CUDA block id and thread is + it avoid that the
* particle index does not overflow (The macro is simple to follow and is located in the file src/Vector/vector_dist_kernel.hpp).
* The rest of the kernel is more or less a copy paste of the CPU iteration code
*
* \snippet Vector/3_molecular_dynamic_gpu/main.cu calculate_force_kernel
*
* For the two kernels that compose the 2 steps verlet integration, we simply followed the steps we described above. More interesting
* is instead the calculation of the total energy of the system. In this case we have to do a full reduction, or accumulating the
* energy of all the particles. In the CPU based code we used an external accumulator and we loop on all the particles summing
* Kinetics energy and potential energy. On GPU this is not possible because such operation is serial and would not be fast on GPU.
* To bypass this problem we have to first calculate the Kinetic energy + the potential energy for each particles in a kernel. This
* kernel is very similar to the CPU code with the difference that we do not accumulate on an external variable, but we store
* in a property per particle.
*
* \snippet Vector/3_molecular_dynamic_gpu/main.cu calculate_energy_kernel
*
* After that, we can use the function \b reduce_local<energy,\_add\_>(vd) \b to operate a parallel reduction on GPU of the property energy
* on the vector_dist_gpu vd (the standard accumulation operation is the sum).
*
* \note reduce_local does only a local (within one GPU) reduction
*
* \snippet Vector/3_molecular_dynamic_gpu/main.cu calc_energy_red
*
* The rest remain mostly the same with the CPU code, with the exception that in the main loop where particle redistribution and ghost
* fill happen directly on GPU we use the option \b RUN_ON_DEVICE \b.
*
* \snippet Vector/3_molecular_dynamic_gpu/main.cu run_on_device
*
* Every time we have to write a file we have to offload from GPU memory to host memory
*
* \snippet Vector/3_molecular_dynamic_gpu/main.cu move_to_host
*
*/
#ifdef __NVCC__
#include "Vector/vector_dist.hpp"
#include "Plot/GoogleChart.hpp"
#include "Plot/util.hpp"
#include "timer.hpp"
#ifdef TEST_RUN
size_t nstep = 100;
#else
size_t nstep = 1000;
#endif
typedef float real_number;
//! \cond [constants] \endcond
constexpr int velocity = 0;
constexpr int force = 1;
constexpr int energy = 2;
//! \cond [calculate_force_kernel] \endcond
template<typename vector_dist_type,typename NN_type>
__global__ void calc_force_gpu(vector_dist_type vd, NN_type NN, real_number sigma12, real_number sigma6, real_number r_cut2)
{
auto p = GET_PARTICLE(vd);
// Get the position xp of the particle
Point<3,real_number> xp = vd.getPos(p);
// Reset the force counter
vd.template getProp<force>(p)[0] = 0.0;
vd.template getProp<force>(p)[1] = 0.0;
vd.template getProp<force>(p)[2] = 0.0;
// Get an iterator over the neighborhood particles of p
auto Np = NN.getNNIterator(NN.getCell(vd.getPos(p)));
// For each neighborhood particle ...
while (Np.isNext())
{
// ... q
auto q = Np.get();
// if (p == q) skip this particle
if (q == p) {++Np; continue;};
// Get the position of p
Point<3,real_number> xq = vd.getPos(q);
// Get the distance between p and q
Point<3,real_number> r = xp - xq;
// take the norm of this vector
real_number rn = norm2(r);
if (rn > r_cut2)
{++Np; continue;};
// Calculate the force, using pow is slower
Point<3,real_number> f = 24.0*(2.0 *sigma12 / (rn*rn*rn*rn*rn*rn*rn) - sigma6 / (rn*rn*rn*rn)) * r;
// we sum the force produced by q on p
vd.template getProp<force>(p)[0] += f.get(0);
vd.template getProp<force>(p)[1] += f.get(1);
vd.template getProp<force>(p)[2] += f.get(2);
// Next neighborhood
++Np;
}
}
//! \cond [calculate_force_kernel] \endcond
template<typename vector_dist_type>
__global__ void update_velocity_position(vector_dist_type vd, real_number dt)
{
auto p = GET_PARTICLE(vd);
// here we calculate v(tn + 0.5)
vd.template getProp<velocity>(p)[0] += 0.5*dt*vd.template getProp<force>(p)[0];
vd.template getProp<velocity>(p)[1] += 0.5*dt*vd.template getProp<force>(p)[1];
vd.template getProp<velocity>(p)[2] += 0.5*dt*vd.template getProp<force>(p)[2];
// here we calculate x(tn + 1)
vd.getPos(p)[0] += vd.template getProp<velocity>(p)[0]*dt;
vd.getPos(p)[1] += vd.template getProp<velocity>(p)[1]*dt;
vd.getPos(p)[2] += vd.template getProp<velocity>(p)[2]*dt;
}
template<typename vector_dist_type>
__global__ void update_velocity(vector_dist_type vd, real_number dt)
{
auto p = GET_PARTICLE(vd);
// here we calculate v(tn + 1)
vd.template getProp<velocity>(p)[0] += 0.5*dt*vd.template getProp<force>(p)[0];
vd.template getProp<velocity>(p)[1] += 0.5*dt*vd.template getProp<force>(p)[1];
vd.template getProp<velocity>(p)[2] += 0.5*dt*vd.template getProp<force>(p)[2];
}
//! \cond [calculate_force_kernel] \endcond
//! \cond [calculate_energy_kernel] \endcond
template<typename vector_dist_type,typename NN_type>
__global__ void particle_energy(vector_dist_type vd, NN_type NN, real_number sigma12, real_number sigma6, real_number shift, real_number r_cut2)
{
auto p = GET_PARTICLE(vd);
// Get the position of the particle p
Point<3,real_number> xp = vd.getPos(p);
// Get an iterator over the neighborhood of the particle p
auto Np = NN.getNNIterator(NN.getCell(vd.getPos(p)));
real_number E = 0;
// For each neighborhood of the particle p
while (Np.isNext())
{
// Neighborhood particle q
auto q = Np.get();
// if p == q skip this particle
if (q == p) {++Np; continue;};
// Get position of the particle q
Point<3,real_number> xq = vd.getPos(q);
// take the normalized direction
real_number rn = norm2(xp - xq);
if (rn > r_cut2)
{++Np;continue;}
// potential energy (using pow is slower)
E += 2.0 * ( sigma12 / (rn*rn*rn*rn*rn*rn) - sigma6 / ( rn*rn*rn) ) - shift;
// Next neighborhood
++Np;
}
// Kinetic energy of the particle given by its actual speed
vd.template getProp<energy>(p) = E + (vd.template getProp<velocity>(p)[0]*vd.template getProp<velocity>(p)[0] +
vd.template getProp<velocity>(p)[1]*vd.template getProp<velocity>(p)[1] +
vd.template getProp<velocity>(p)[2]*vd.template getProp<velocity>(p)[2]) / 2;
}
//! \cond [calculate_energy_kernel] \endcond
//! \cond [calc_forces] \endcond
template<typename CellList> void calc_forces(vector_dist_gpu<3,real_number, aggregate<real_number[3],real_number[3],real_number> > & vd, CellList & NN, real_number sigma12, real_number sigma6, real_number r_cut2)
{
vd.updateCellList(NN);
// Get an iterator over particles
auto it2 = vd.getDomainIteratorGPU();
calc_force_gpu<<<it2.wthr,it2.thr>>>(vd.toKernel(),NN.toKernel(),sigma12,sigma6,r_cut2);
}
//! \cond [calc_forces] \endcond
template<typename CellList> real_number calc_energy(vector_dist_gpu<3,real_number, aggregate<real_number[3],real_number[3],real_number> > & vd, CellList & NN, real_number sigma12, real_number sigma6, real_number r_cut2)
{
real_number rc = r_cut2;
real_number shift = 2.0 * ( sigma12 / (rc*rc*rc*rc*rc*rc) - sigma6 / ( rc*rc*rc) );
vd.updateCellList(NN);
auto it2 = vd.getDomainIteratorGPU();
particle_energy<<<it2.wthr,it2.thr>>>(vd.toKernel(),NN.toKernel(),sigma12,sigma6,shift,r_cut2);
//! \cond [calc_energy_red] \endcond
// Calculated energy
return reduce_local<energy,_add_>(vd);
//! \cond [calc_energy_red] \endcond
}
int main(int argc, char* argv[])
{
openfpm_init(&argc,&argv);
real_number sigma = 0.01;
real_number r_cut = 3.0*sigma;
// we will use it do place particles on a 10x10x10 Grid like
size_t sz[3] = {100,100,100};
// domain
Box<3,float> box({0.0,0.0,0.0},{1.0,1.0,1.0});
// Boundary conditions
size_t bc[3]={PERIODIC,PERIODIC,PERIODIC};
// ghost, big enough to contain the interaction radius
Ghost<3,float> ghost(r_cut);
real_number dt = 0.00005;
real_number sigma12 = pow(sigma,12);
real_number sigma6 = pow(sigma,6);
openfpm::vector<real_number> x;
openfpm::vector<openfpm::vector<real_number>> y;
vector_dist_gpu<3,real_number, aggregate<real_number[3],real_number[3],real_number> > vd(0,box,bc,ghost);
// We create the grid iterator
auto it = vd.getGridIterator(sz);
while (it.isNext())
{
// Create a new particle
vd.add();
// key contain (i,j,k) index of the grid
auto key = it.get();
// The index of the grid can be accessed with key.get(0) == i, key.get(1) == j ...
// We use getLastPos to set the position of the last particle added
vd.getLastPos()[0] = key.get(0) * it.getSpacing(0);
vd.getLastPos()[1] = key.get(1) * it.getSpacing(1);
vd.getLastPos()[2] = key.get(2) * it.getSpacing(2);
// We use getLastProp to set the property value of the last particle we added
vd.template getLastProp<velocity>()[0] = 0.0;
vd.template getLastProp<velocity>()[1] = 0.0;
vd.template getLastProp<velocity>()[2] = 0.0;
vd.template getLastProp<force>()[0] = 0.0;
vd.template getLastProp<force>()[1] = 0.0;
vd.template getLastProp<force>()[2] = 0.0;
++it;
}
vd.hostToDevicePos();
vd.hostToDeviceProp<velocity,force>();
vd.map(RUN_ON_DEVICE);
vd.ghost_get<>(RUN_ON_DEVICE);
timer tsim;
tsim.start();
//! \cond [md steps] \endcond
// Get the Cell list structure
auto NN = vd.getCellListGPU(r_cut);
// The standard
// auto NN = vd.getCellList(r_cut);
// calculate forces
calc_forces(vd,NN,sigma12,sigma6,r_cut*r_cut);
unsigned long int f = 0;
// MD time stepping
for (size_t i = 0; i < nstep ; i++)
{
// Get the iterator
auto it3 = vd.getDomainIteratorGPU();
update_velocity_position<<<it3.wthr,it3.thr>>>(vd.toKernel(),dt);
//! \cond [run_on_device] \endcond
// Because we moved the particles in space we have to map them and re-sync the ghost
vd.map(RUN_ON_DEVICE);
vd.template ghost_get<>(RUN_ON_DEVICE);
//! \cond [run_on_device] \endcond
// calculate forces or a(tn + 1) Step 2
calc_forces(vd,NN,sigma12,sigma6,r_cut*r_cut);
// Integrate the velocity Step 3
auto it4 = vd.getDomainIteratorGPU();
update_velocity<<<it4.wthr,it4.thr>>>(vd.toKernel(),dt);
// After every iteration collect some statistic about the configuration
if (i % 100 == 0)
{
//! \cond [move_to_host] \endcond
vd.deviceToHostPos();
vd.deviceToHostProp<0,1,2>();
// We write the particle position for visualization (Without ghost)
vd.deleteGhost();
vd.write_frame("particles_",f);
//! \cond [move_to_host] \endcond
// we resync the ghost
vd.ghost_get<>(RUN_ON_DEVICE);
// We calculate the energy
real_number energy = calc_energy(vd,NN,sigma12,sigma6,r_cut*r_cut);
auto & vcl = create_vcluster();
vcl.sum(energy);
vcl.execute();
// we save the energy calculated at time step i c contain the time-step y contain the energy
x.add(i);
y.add({energy});
// We also print on terminal the value of the energy
// only one processor (master) write on terminal
if (vcl.getProcessUnitID() == 0)
std::cout << "Energy: " << energy << std::endl;
f++;
}
}
//! \cond [md steps] \endcond
tsim.stop();
std::cout << "Time: " << tsim.getwct() << std::endl;
// Google charts options, it store the options to draw the X Y graph
GCoptions options;
// Title of the graph
options.title = std::string("Energy with time");
// Y axis name
options.yAxis = std::string("Energy");