main.cu 11 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249
/*! \page Vector_9_gpu_cuda_interop Vector 9 GPU cuda interoperability
 *
 *
 * [TOC]
 *
 *
 * # GPU CUDA inter-operability with plain arrays # {#GPU_9_cuda_interop}
 *
 * OpenFPM provide the possibility to operate with CUDA using plain arrays. In particular we can ask to the distributed
 * data-structure to return a CUDA device pointer to the data. Before operate with such pointer we must understand
 * how vector_dist_gpu store data internally in order to correctly read data from such pointer
 *
 * ## Array striding {#e9_array_stride}
 *
 * To understand how vector_dist_gpu store data, we will print the address in memory of each element. Let start for printing
 *  the address of the first particle for all the properties
 *
 * \snippet Vector/9_gpu_cuda_interop/main.cu first_particle_prop_zero_and_one_two
 *
 * Running the program on one process we get
 *
 * \code
First particle property 0, address: 0x7f8d63c00400
First particle property 1, address: 0x7f8d63c00600
First particle property 2, address: 0x7f8d83400000
 * \endcode
 *
 * As we can see the scalar property, vector and tensor properties are not nearly contiguous, the reason is that every properties use
 * its own CUDA buffer, and each property can be off-loaded separately.
 *
 * Now we check how the component of the vector are stored in memory, printing the address of the components for the vector and for
 * the tensor property.
 *
 * \snippet Vector/9_gpu_cuda_interop/main.cu first_particle_vector_tensor_layout
 *
 * The output that we can obtain is something like
 *
  \code
Capacity internal vector: 128
First particle property 1 component 0, address: 0x7f8d63c00600
First particle property 1 component 1, address: 0x7f8d63c00800
First particle property 2 component 00, address: 0x7f8d83400000
First particle property 2 component 01, address: 0x7f8d83400200
First particle property 2 component 10, address: 0x7f8d83400400
First particle property 2 component 11, address: 0x7f8d83400600
Second particle property 1 component 0, address: 0x7f8d63c00604
Second particle property 1 component 1, address: 0x7f8d63c00804
Second particle property 2 component 00, address: 0x7f8d83400004
Second particle property 2 component 01, address: 0x7f8d83400204
Second particle property 2 component 10, address: 0x7f8d83400404
Second particle property 2 component 11, address: 0x7f8d83400604
  \endcode
 *
 * As we can see the vector property of first particle component y is not contiguous to x, but is 0x200 = 4 byte * 128 offset from
 * the component x. What is contiguous to particle 0 component x is particle 1 component x
 *
 * \note This is in general hidden and transparent to the user. Infact in the example we have shown, we were able to create a distributed vector and
 *       compute on it without know how vector_dist store data. It only become necessary if you want to use CUDA with plain primitive arrays
 *
 * There is a reason why vector_dist_gpu use this layout and is because of memory coalesced access. Suppose you want to access
 * a vector property in the GPU kernel like this
 *
 * \code
 *
 * vd.template getProp<vector>(p)[0]
 *
 * \endcode
 *
 * In general what we do is to map the particle index p to a GPU thread that handle that particle. Doing so let see what happen
 * when one SM hit that instruction using the standard layout.
 *
 * \verbatim
                          Memory                                                              Memory

   particle 0  [0]x        0x000      <------- Access thread 0         particle 0  [0]x        0x000      <------- Access thread 0
               [1]y        0x004                                       particle 1  [0]x        0x004      <------- Access thread 1
   particle 1  [0]x        0x008      <------- Access thread 1         particle 2  [0]x        0x008      <------- Access thread 2
               [1]y          .                                         particle 3  [0]x        0x00C      <------- Access thread 3
   particle 2  [0]x          .                                                  .
               [1]y          .                                                  .
               .             .                                                  .
               .             .                                                  .
               .             .                                                  .
   particle N  [0]x          .       <-------- Access thread N                  .
               [1]y          .                                         particle 0  [1]y

                  Case A                                                            Case B
 * \endverbatim
 *
 * As we can see from the image in case A there is a jump of 4 byte compared of Case B. And this mean that the instruction will read double
 *  of the cache lines compared to case B.
 *
 *  Remain to understand why having 100 particles the component y stay at 4 * 128 = 512 byte instead of 4 * 100 = 400 byte. One power 2 alignment
 *  the other is instead related to the internal preallocated vector buffer. Suppose to have a vector with 4 particles and we want to add one
 *  particle at the end. Because we do not have space in theory we have to create a vector of 5 elements, copy the 4 elements in the new vector
 *  and add the last elements. This is clearly expensive when the vector become big, copy the full vector to just one element would not make sense.
 *  OpenFPM use by default a policy to expand the vector by a factor (default = 2) to guarantee that if a vector with N elements starting from
 *  an a vector of size 0 have cost O(N).
 *
 * \note OpenFPM by default does not operate any attempt to expand the virtual address space of the structure to avoid copy
 *
 * ## Interoperability with CUDA {#e9_interop_cuda}
 *
 * Now that we understood the structure of the device pointer, we can see how we can use the internal device pointer in a cuda kernel.
 * We now launch a kernel just to print the information inside the buffer. To get the device CUDA pointer we can use the combo functions
 * \b getPropVector() \b to the the internal propetries vector follow  by \b getDeviceBuffer<0>() \b that return the CUDA device
 * buffer for the property 0
 *
 * \snippet Vector/9_gpu_cuda_interop/main.cu print_50
 *
 * the kernel print the information of particle 50. To note how we pass primitive arrays to the kernel and we use capacity to
 * access the component of vector and the tensor accordingly  to what we explained in array striding
 *
 * \snippet Vector/9_gpu_cuda_interop/main.cu print_data_kernel
 *
 *
 *
 * ## Full code ## {#code_e9_sim}
 *
 * \include Vector/9_gpu_cuda_interop/main.cu
 *
 */

#ifdef __NVCC__

#include "Vector/vector_dist.hpp"

//! [print_data_kernel]

__global__ void print_data_particle_50(float * scalar, float * vector, float * tensor, int capacity)
{
	int p = threadIdx.x + blockIdx.x * blockDim.x;

	if (p == 50)
	{
		printf("Scalar particle %d = %f\n",p,scalar[p]);

		printf("Vector particle %d = %f\n",p,vector[p]);
		printf("Vector particle %d = %f\n",p,vector[p + capacity]);

		printf("Tensor particle %d = %f\n",p,tensor[p + (0*2 + 0)*capacity]);
		printf("Tensor particle %d = %f\n",p,tensor[p + (0*2 + 1)*capacity]);
		printf("Tensor particle %d = %f\n",p,tensor[p + (1*2 + 0)*capacity]);
		printf("Tensor particle %d = %f\n",p,tensor[p + (1*2 + 1)*capacity]);
	}
}

//! [print_data_kernel]

int main(int argc, char* argv[])
{
    // initialize the library
	openfpm_init(&argc,&argv);

	// Here we define our domain a 2D box with internals from 0 to 1.0 for x and y
	Box<2,float> domain({0.0,0.0},{1.0,1.0});

	// Here we define the boundary conditions of our problem
    size_t bc[2]={PERIODIC,PERIODIC};

	// extended boundary around the domain, and the processor domain
	Ghost<2,float> g(0.05);

    vector_dist_gpu<2,float, aggregate<float,float[2],float[2][2]> > vd(100,domain,bc,g);

	auto it = vd.getDomainIterator();

	while (it.isNext())
	{
		auto p = it.get();

		// we define x, assign a random position between 0.0 and 1.0
		vd.getPos(p)[0] = (float)rand() / RAND_MAX;

		// we define y, assign a random position between 0.0 and 1.0
		vd.getPos(p)[1] = (float)rand() / RAND_MAX;

		vd.template getProp<0>(p) = vd.getPos(p)[0] + vd.getPos(p)[1];

		vd.template getProp<1>(p)[0] = vd.getPos(p)[0];
		vd.template getProp<1>(p)[1] = vd.getPos(p)[1];

		vd.template getProp<2>(p)[0][0] = vd.getPos(p)[0];
		vd.template getProp<2>(p)[0][1] = vd.getPos(p)[1];
		vd.template getProp<2>(p)[1][0] = vd.getPos(p)[0] + vd.getPos(p)[1];
		vd.template getProp<2>(p)[1][1] = vd.getPos(p)[1] - vd.getPos(p)[0];

		// next particle
		++it;
	}

	vd.map();

	//! \cond [map_and_ghost_get_on_gpu] \endcond

	//! \cond [first_particle_prop_zero_and_one_two] \endcond

	std::cout << "First particle property 0, address: " << &vd.template getProp<0>(0) << std::endl;
	std::cout << "First particle property 1, address: " << &vd.template getProp<1>(0)[0] << std::endl;
	std::cout << "First particle property 2, address: " << &vd.template getProp<2>(0)[0][0] << std::endl;

	//! \cond [first_particle_prop_zero_and_one_two] \endcond

	//! \cond [first_particle_vector_tensor_layout] \endcond

	std::cout << "Capacity internal vector: " << vd.getPropVector().capacity() << std::endl;

	std::cout << "First particle property 1 component 0, address: " << &vd.template getProp<1>(0)[0] << std::endl;
	std::cout << "First particle property 1 component 1, address: " << &vd.template getProp<1>(0)[1] << std::endl;

	std::cout << "First particle property 2 component 00, address: " << &vd.template getProp<2>(0)[0][0] << std::endl;
	std::cout << "First particle property 2 component 01, address: " << &vd.template getProp<2>(0)[0][1] << std::endl;
	std::cout << "First particle property 2 component 10, address: " << &vd.template getProp<2>(0)[1][0] << std::endl;
	std::cout << "First particle property 2 component 11, address: " << &vd.template getProp<2>(0)[1][1] << std::endl;

	std::cout << "Second particle property 1 component 0, address: " << &vd.template getProp<1>(1)[0] << std::endl;
	std::cout << "Second particle property 1 component 1, address: " << &vd.template getProp<1>(1)[1] << std::endl;

	std::cout << "Second particle property 2 component 00, address: " << &vd.template getProp<2>(1)[0][0] << std::endl;
	std::cout << "Second particle property 2 component 01, address: " << &vd.template getProp<2>(1)[0][1] << std::endl;
	std::cout << "Second particle property 2 component 10, address: " << &vd.template getProp<2>(1)[1][0] << std::endl;
	std::cout << "Second particle property 2 component 11, address: " << &vd.template getProp<2>(1)[1][1] << std::endl;

	//! \cond [first_particle_vector_tensor_layout] \endcond

	std::cout << std::endl;

	//! \cond [print_50] \endcond

	vd.template hostToDeviceProp<0,1,2>();

	print_data_particle_50<<<100,1>>>((float *)vd.getPropVector().template getDeviceBuffer<0>(),
			               (float *)vd.getPropVector().template getDeviceBuffer<1>(),
			               (float *)vd.getPropVector().template getDeviceBuffer<2>(),
			               vd.getPropVector().capacity());

	//! \cond [print_50] \endcond

	openfpm_finalize();
}

#else

int main(int argc, char* argv[])
{
        return 0;
}

#endif