Vector Addition Cuda - Parallel programming

In this articles we covers :

1) Serial code with brief description

2) Parallel code with brief description

3) Compare time of parallel code and serial code

4) compare with graph of parallel code time and serial code time

Brief description of the problem :

we have used three arrays a, b, c. Adding two arrays and storing a result into the third array. In this problem, we used a different problem size for addition like started from 2^8 to 2^29 and measure a time. Furthermore, here, we considered time which is both computation time and memory access time.

The intention of this problem statement is to compare the serial code time vs parallel code time. For parallel programming, we used the Cuda c/c++ compile.

The time complexity of the problem is O(n)

1) Serial code with brief description

First we started with serial code of vector addition :

#include<stdio.h>
#include<math.h>
#include<time.h>
#include <stdlib.h>
void intialization(int *a,int *b,int size1);
void calculation(int *a , int *b , int *c, int size);
int main(){

int min = pow(2,8);
int max = pow(2,29);
int size =0;
int i =0 ;
int *a , *b , *c ;
clock_t start , end;
double walltime , th;
for(size = min;size<max;size=size*2){

a =(int *)malloc(size*sizeof(int));
b =(int *)malloc(size*sizeof(int));
c =(int *)malloc(size*sizeof(int));
//printf("%d",size);
intialization(a,b,size);
start = clock();
calculation(a,b,c,size);
end = clock();
walltime =(end-start)/(double)CLOCKS_PER_SEC;
th = (max*sizeof(int))/walltime;
// th = th/pow(10,9);
printf("%lf\n",th);

free(a);
free(b);
free(c);
}

return 0;
}

void intialization(int *a , int *b,int size1){
int i=0;
for(i=0;i<size1;i++){
a[i] = 1;
b[i] = 1;
}
}

void calculation(int *a, int *b ,int *c, int size){
int i = 0 ;
for(i=0;i<size;i++){
c[i] = a[i] + b[i];
}
}

Explanation:

Above the program, there are two functions. one is initialization function which initializes the value of the array corresponding to the problem size and another is a calculation that adds array a and array b and store it into array c.

To begin with, we initialize a min as 2^8 and max as 2^29. so, we use one for loop to increase the problem size and for each problem size, we allocate the memory using the malloc function. Apart from this, we initialize the array and calculate vector addition along with measure the time and throughput of the code. In the end, we free the memory

Note: to run the c/c++ program into google colab , follow this article ,https://www.geeksforgeeks.org/how-to-run-cuda-c-c-on-jupyter-notebook-in-google-colaboratory/

2) Parallel code with brief description

Now we start the parallel programming with cuda compiler. I have written one code for same as above but only change is to use cuda programming.

Here we use one additional library that is #include<cuda.h>

Program :

#include<stdio.h>
#include<math.h>
#include<cuda.h>
#include<time.h>

void initialization(int *a,int *b,int length);
__global__ void add(int *a ,int *b ,int *c){

int id = blockIdx.x * blockDim.x + threadIdx.x;
if(id<N){
c[id] = a[id] + b[id];
}

}

int main(){

int *h_a , *h_b , *h_c;
int *dev_a , *dev_b , *dev_c;
//int min = pow(2,8);
int max = pow(2,29);
clock_t start_t , end_t;
int size = 0;
double walltime, th;
int j = 0;
for(j =256; j<max; j=j*2)
{

size = j*sizeof(int);
//allocate memory to host variable
h_a = (int*)malloc(size);
h_b = (int*)malloc(size);
h_c = (int*)malloc(size);

//initialize a host variable
initialization(h_a, h_b,j);

start_t = clock();
//allocate memory to device variables
cudaMalloc((void **)&dev_a,size);
cudaMalloc((void **)&dev_b,size);
cudaMalloc((void **)&dev_c,size);

//copy host to device

cudaMemcpy(dev_a,h_a,size,cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,h_b,size,cudaMemcpyHostToDevice);

//call kernal function

add<<<4,j/4>>>(dev_a, dev_b , dev_c);
// end_t = clock();

//copy device to host
cudaMemcpy(h_c,dev_c,size,cudaMemcpyDeviceToHost);
end_t = clock();

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
/*for(int k=0;k<N;k++){
printf("%d\n",h_c);
}*/
// to calculate time
walltime =(end_t - start_t)/(double)CLOCKS_PER_SEC;
th = (max*sizeof(int))/walltime;
th = th/pow(10,9);

printf("%lf\n",walltime);

free(h_a);
free(h_b);
free(h_c);

}

return 0;
}

void initialization(int *a, int *b,int length){
for(int i=0;i<length;i++){
a[i] = 1;
b[i] = i+1;
}

}

Explanation :

Parallel programs have mainly had two functions one is initialization function which initializes host(CPU) variable and another is the add function that runs on GPU which task is simple like vector addition. It is also called kernel

To commence with, First, we declare all library that is needed in the program. All the libraries we are all aware of except cuda.h. Here, we declare the two 3 arrays for host and three variables for device(GPU) and we use the same for loop that was declared in the serial code of vector addition. Its task is to increment the problem size. Inside the for loop, first, we allocate the host variable memory using malloc function and initialize using the function. Now, the new thing is here we use cudamalloc to allocate memory for device variables. The interesting thing is we passed the host memory to device memory. For example, we mentioned that we allocate the memory of host(CPU) variable and initialized the variable by function. For instance, an array of host(CPU) variable memory is simply passed to the device(GPU) variable.

Furthermore,we called the add(kernel) function it will be run on GPU.fter that, we do a reverse process like transfer device memory to host memory. In the end , we free both host(CPU) and device(GPU) memory.

this program also measures the time and the throughput for all problem size.

3) Compare the time of serial code and parallel code of vector addition :-

The time of serial code and parallel code is stricly depend upon the architecture of the hardware. So , when you run this code on your system , time will be slightly vary. If you have powerful cpu and gpu then it will run fast.

Here , we use google colab to run this program. If you are using a linux system then you can easily see the architecture information by running the command " lscpu " .

This is google colab cpu information :

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU @ 2.30GHz Stepping: 0 CPU MHz: 2299.998 BogoMIPS: 4599.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 46080K NUMA node0 CPU(s): 0,1

This is google colab gpu infotmation :

CUDA Device Query... There are 1 CUDA devices. CUDA Device #0 Major revision number: 3 Minor revision number: 7 Name: Tesla K80 Total global memory: 11996954624 Total shared memory per block: 49152 Total registers per block: 65536 Warp size: 32 Maximum memory pitch: 2147483647 Maximum threads per block: 1024 Maximum dimension 0 of block: 1024 Maximum dimension 1 of block: 1024 Maximum dimension 2 of block: 64 Maximum dimension 0 of grid: 2147483647 Maximum dimension 1 of grid: 65535 Maximum dimension 2 of grid: 65535 Clock rate: 823500 Total constant memory: 65536 Texture alignment: 512 Concurrent copy and execution: Yes Number of multiprocessors: 13 Kernel execution timeout: No

Output serial code :

Number : 256 Walltime :0.135601 Throughput: 15.836783
Number : 512 Walltime :0.000219 Throughput: 9805.861406
Number : 1024 Walltime :0.000189 Throughput: 11362.347344
Number : 2048 Walltime :0.000238 Throughput: 9023.040538
Number : 4096 Walltime :0.000201 Throughput: 10683.998249
Number : 8192 Walltime :0.000228 Throughput: 9418.787930
Number : 16384 Walltime :0.000281 Throughput: 7642.290562
Number : 32768 Walltime :0.000368 Throughput: 5835.553391
Number : 65536 Walltime :0.000599 Throughput: 3585.114604
Number : 131072 Walltime :0.001024 Throughput: 2097.152000
Number : 262144 Walltime :0.001873 Throughput: 1146.547596
Number : 524288 Walltime :0.002493 Throughput: 861.405394
Number : 1048576 Walltime :0.004421 Throughput: 485.746132
Number : 2097152 Walltime :0.008099 Throughput: 265.154173
Number : 4194304 Walltime :0.015666 Throughput: 137.079258
Number : 8388608 Walltime :0.032623 Throughput: 65.827289
Number : 16777216 Walltime :0.060658 Throughput: 35.403140
Number : 33554432 Walltime :0.116946 Throughput: 18.363036
Number : 67108864 Walltime :0.234380 Throughput: 9.162401
Number : 134217728 Walltime :0.466315 Throughput: 4.605221
Number : 268435456 Walltime :0.925140 Throughput: 2.321253

Parallel code output :

Number : 65536 walltime : 0.000371, Throughput : 5788.365628
Number : 131072 walltime : 0.000701, Throughput : 3063.457415
Number : 262144 walltime : 0.001706, Throughput : 1258.782912
Number : 524288 walltime : 0.002982, Throughput : 720.148775
Number : 1048576 walltime : 0.006064, Throughput : 354.136485
Number : 2097152 walltime : 0.012341, Throughput : 174.012126
Number : 4194304 walltime : 0.024580, Throughput : 87.367113
Number : 8388608 walltime : 0.049522, Throughput : 43.364235
Number : 16777216 walltime : 0.100103, Throughput : 21.452740
Number : 33554432 walltime : 0.199526, Throughput : 10.762926
Number : 67108864 walltime : 0.390404, Throughput : 5.500670
Number : 134217728 walltime : 0.777626, Throughput : 2.761589
Number : 268435456 walltime : 1.519407, Throughput : 1.413370

4) compare with graph of parallel code time and serial code time.

Serial code :

Problem size vs walltime :

Note: x axis represent all number in power of 2

Parallel code :

Problem Size vs walltime :

Note: X axis reperesent all the value in power of 2.

Summary :

if problem size is big then use a gpu otherwise cpu gives better perfomance. we can see this proof in graph.

BitCoding

Search This Blog

Now Access to Gemini (Google) : Really beat to ChatGPT ?

Vector Addition Cuda - Parallel programming

Summary :

Comments

Post a Comment