Matrix multiplication in CUDA

There are multiple branches of mathematics. However, today we discuss one of the branches of mathematics: linear algebra. It is more applied in machine learning algorithms, particularly in image processing algorithms. Matrix is calculated from the number of vectors, and matrix multiplication is the binary operation of the matrix from the two different matrices, but one requirement is the column of the first matrix is equal to the row of the second matrix.

Example :

Matrix looks like this:

[ A00 A01 A02] [ B00 B01 B02]

A = [ A10 A11 A12] B = [ B10 B11 B12]

[ A20 A21 A22] [ B20 B21 B22]

[ C00 C01 C02]

C =A . B =[ C10 C11 C12]

[ C20 C21 C22]

First, we take the first row of matrix A and the first column of matrix B and do addition and multiplication element-wise.

For instance :

we calculate C00 = A00 * B00 + A01 * B10 + A02 * B20

If we write a serial code for the matrix multiplication, it takes the O(n^3) complexity, and memory complexity is O(n^2). (If you want to see the serial code of matrix multiplication, then Get code)

Now, we do the parallel code for the matrix multiplication. Here, we don't take input from a user; we just initialize random numbers and do the calculation.

Note: If you don't have an Nvidia graphics card, you can use the google colab to run the program. If you are new in google colab, then set up the nvcc compiler in google colab by following the step : Get Info.

Parallel code for matrix multiplication :

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

#include <cuda.h>

#define BLOCK_SIZE 16

__global__ void gpu_matrix_mult(int *a,int *b, int *c, int n)

{

int row = blockIdx.y * blockDim.y + threadIdx.y;

int col = blockIdx.x * blockDim.x + threadIdx.x;

int sum = 0;

if( col < n && row < n)

{

for(int i = 0; i < n; i++)

{

sum += a[row * n + i] * b[i * n + col];

}

c[row * n + col] = sum;

}

int main(int argc, char const *argv[])

{

int n;

for(n = 64; n <=8192 ; n *= 2) {

// allocate memory in host RAM, h_cc is used to store CPU result

int *h_a, *h_b, *h_c, *h_cc;

cudaMallocHost((void **) &h_a, sizeof(int)*n*n);

cudaMallocHost((void **) &h_b, sizeof(int)*n*n);

cudaMallocHost((void **) &h_c, sizeof(int)*n*n);

// initialize matrix A

for (int i = 0; i < n; ++i) {

for (int j = 0; j < n; ++j) {

h_a[i * n + j] = 2;

}

// initialize matrix B

for (int i = 0; i < n; ++i) {

for (int j = 0; j < n; ++j) {

h_b[i * n + j] = 3;

}

float naive_gpu_elapsed_time_ms;

// some events to count the execution time

//clock_t st, end;

cudaEvent_t start, stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);

// Allocate memory space on the device

int *d_a, *d_b, *d_c;

cudaMalloc((void **) &d_a, sizeof(int)*n*n);

cudaMalloc((void **) &d_b, sizeof(int)*n*n);

cudaMalloc((void **) &d_c, sizeof(int)*n*n);

// copy matrix A and B from host to device memory

cudaMemcpy(d_a, h_a, sizeof(int)*n*n, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, h_b, sizeof(int)*n*n, cudaMemcpyHostToDevice);

unsigned int grid_rows = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;

unsigned int grid_cols = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;

dim3 dimGrid(grid_cols, grid_rows);

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

cudaEventRecord(start, 0);

gpu_matrix_mult<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, n);

cudaThreadSynchronize();

// time counting terminate

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

// Transfer results from device to host

cudaMemcpy(h_cc, d_c, sizeof(int)*n*n, cudaMemcpyDeviceToHost);

// compute time elapsed on GPU computing

cudaEventElapsedTime(&naive_gpu_elapsed_time_ms, start, stop);

printf("Time elapsed on naive GPU matrix multiplication of %dx%d . %dx%d : %f ms.\n\n", n, n, n, n, naive_gpu_elapsed_time_ms);

// free memory

cudaFree(d_a);

cudaFree(d_b);

cudaFree(d_c);

Free(h_a);

Free(h_b);

Free(h_c);

Free(h_cc);

}

return 0;

}

Note: if you are interested to learn profiling, then click here. Profiling means they give some statistics information of execution time of the function and other things.

Summary :

-we learn the basic format of the matrix.

-Another, learn matrix multiplication.

-learn the complexity.

-learn the parallel code for matrix multiplication.

If you want to optimize this parallel code, we use another method called Tiling. In this way, we view another memory called shared memory. Above, we access the global memory, which takes 100 cycles or more but shared memory holds only 2-4 to fetch data. Besides, shared memory has limited storage; we need to think of 'how to use the memory efficiently' so we increase the program's scalability.

BitCoding

Search This Blog

Now Access to Gemini (Google) : Really beat to ChatGPT ?

Matrix multiplication in CUDA

Comments

Post a Comment