Now Access to Gemini (Google) : Really beat to ChatGPT ?

Matrix multiplication in CUDA

There are multiple branches of mathematics. However, today we discuss one of the branches of mathematics: linear algebra. It is more applied in machine learning algorithms, particularly in image processing algorithms. Matrix is calculated from the number of vectors, and matrix multiplication is the binary operation of the matrix from the two different matrices, but one requirement is the column of the first matrix is equal to the row of the second matrix.


Example  :


Matrix looks like this: 

      [ A00  A01  A02]           [ B00  B01  B02]  
A = [ A10  A11  A12]     B = [ B10  B11  B12]  
      [ A20  A21  A22]           [ B20  B21  B22]  

                    [ C00  C01  C02]
   C =A . B  =[ C10  C11  C12]
                    [ C20  C21  C22]
First, we take the first row of matrix A and the first column of matrix B and do addition and multiplication element-wise.

For instance : 

            we calculate C00 =  A00 * B00 + A01 * B10 + A02 * B20




If we write a serial code for the matrix multiplication, it takes the O(n^3) complexity, and memory complexity is O(n^2). (If you want to see the serial code of matrix multiplication, then Get code)

Now, we do the parallel code for the matrix multiplication. Here, we don't take input from a user; we just initialize random numbers and do the calculation.


Note: If you don't have an Nvidia graphics card, you can use the google colab to run the program. If you are new in google colab, then set up the nvcc compiler in google colab by following the step : Get Info.

Parallel code for matrix multiplication : 

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>

#define BLOCK_SIZE 16


__global__ void gpu_matrix_mult(int *a,int *b, int *c, int n)
    int row = blockIdx.y * blockDim.y + threadIdx.y; 
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int sum = 0;
    if( col < n && row < n) 
    {
        for(int i = 0; i < n; i++) 
        {
            sum += a[row * n + i] * b[i * n + col];
        }
        c[row * n + col] = sum;
    }


 


int main(int argc, char const *argv[])
{
    int n;
    
    for(n = 64; n <=8192 ; n *= 2) {
        // allocate memory in host RAM, h_cc is used to store CPU result
        int *h_a, *h_b, *h_c, *h_cc;
        cudaMallocHost((void **) &h_a, sizeof(int)*n*n);
        cudaMallocHost((void **) &h_b, sizeof(int)*n*n);
        cudaMallocHost((void **) &h_c, sizeof(int)*n*n);
        

        // initialize matrix A
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < n; ++j) {
                h_a[i * n + j] = 2;
            }
        }

        // initialize matrix B
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < n; ++j) {
                h_b[i * n + j] = 3;
            }
        }

        float  naive_gpu_elapsed_time_ms;

        // some events to count the execution time
        //clock_t st, end;
        cudaEvent_t start, stop;
        cudaEventCreate(&start);
        cudaEventCreate(&stop);

        // Allocate memory space on the device 
        int *d_a, *d_b, *d_c;
        cudaMalloc((void **) &d_a, sizeof(int)*n*n);
        cudaMalloc((void **) &d_b, sizeof(int)*n*n);
        cudaMalloc((void **) &d_c, sizeof(int)*n*n);
        
        
        
        // copy matrix A and B from host to device memory
        cudaMemcpy(d_a, h_a, sizeof(int)*n*n, cudaMemcpyHostToDevice);
        cudaMemcpy(d_b, h_b, sizeof(int)*n*n, cudaMemcpyHostToDevice);
        
        unsigned int grid_rows = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
        unsigned int grid_cols = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
        dim3 dimGrid(grid_cols, grid_rows);
        dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
        
        
        
        cudaEventRecord(start, 0);
        gpu_matrix_mult<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, n);
        cudaThreadSynchronize();
       
        // time counting terminate
       
        cudaEventRecord(stop, 0);
        cudaEventSynchronize(stop);

        // Transfer results from device to host 
        cudaMemcpy(h_cc, d_c, sizeof(int)*n*n, cudaMemcpyDeviceToHost);

        // compute time elapsed on GPU computing
        cudaEventElapsedTime(&naive_gpu_elapsed_time_ms, start, stop);
        printf("Time elapsed on naive GPU matrix multiplication of %dx%d . %dx%d : %f ms.\n\n", n, n, n, n, naive_gpu_elapsed_time_ms);
        
        
        // free memory
        cudaFree(d_a);
        cudaFree(d_b);
        cudaFree(d_c);
        Free(h_a);
        Free(h_b);
        Free(h_c);
        Free(h_cc);
    }
    return 0;
}

Note: if you are interested to learn profiling, then click here. Profiling means they give some statistics information of execution time of the function and other things. 

Summary : 

-we learn the basic format of the matrix.
-Another, learn matrix multiplication.
-learn the complexity.
-learn the parallel code for matrix multiplication.


If you want to optimize this parallel code, we use another method called Tiling. In this way, we view another memory called shared memory. Above, we access the global memory, which takes 100 cycles or more but shared memory holds only 2-4 to fetch data. Besides, shared memory has limited storage; we need to think of 'how to use the memory efficiently' so we increase the program's scalability.


Comments