- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
There are multiple branches of mathematics. However, today we discuss one of the branches of mathematics: linear algebra. It is more applied in machine learning algorithms, particularly in image processing algorithms. Matrix is calculated from the number of vectors, and matrix multiplication is the binary operation of the matrix from the two different matrices, but one requirement is the column of the first matrix is equal to the row of the second matrix.
Example :
Matrix looks like this:
[ A00 A01 A02] [ B00 B01 B02]
A = [ A10 A11 A12] B = [ B10 B11 B12]
[ A20 A21 A22] [ B20 B21 B22]
[ C00 C01 C02]
C =A . B =[ C10 C11 C12]
[ C20 C21 C22]
First, we take the first row of matrix A and the first column of matrix B and do addition and multiplication element-wise.
For instance :
we calculate C00 = A00 * B00 + A01 * B10 + A02 * B20
If we write a serial code for the matrix multiplication, it takes the O(n^3) complexity, and memory complexity is O(n^2). (If you want to see the serial code of matrix multiplication, then Get code)
Now, we do the parallel code for the matrix multiplication. Here, we don't take input from a user; we just initialize random numbers and do the calculation.
Note: If you don't have an Nvidia graphics card, you can use the google colab to run the program. If you are new in google colab, then set up the nvcc compiler in google colab by following the step : Get Info.
Parallel code for matrix multiplication :
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
#define BLOCK_SIZE 16
__global__ void gpu_matrix_mult(int *a,int *b, int *c, int n)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int sum = 0;
if( col < n && row < n)
{
for(int i = 0; i < n; i++)
{
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}
int main(int argc, char const *argv[])
{
int n;
for(n = 64; n <=8192 ; n *= 2) {
// allocate memory in host RAM, h_cc is used to store CPU result
int *h_a, *h_b, *h_c, *h_cc;
cudaMallocHost((void **) &h_a, sizeof(int)*n*n);
cudaMallocHost((void **) &h_b, sizeof(int)*n*n);
cudaMallocHost((void **) &h_c, sizeof(int)*n*n);
// initialize matrix A
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
h_a[i * n + j] = 2;
}
}
// initialize matrix B
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
h_b[i * n + j] = 3;
}
}
float naive_gpu_elapsed_time_ms;
// some events to count the execution time
//clock_t st, end;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Allocate memory space on the device
int *d_a, *d_b, *d_c;
cudaMalloc((void **) &d_a, sizeof(int)*n*n);
cudaMalloc((void **) &d_b, sizeof(int)*n*n);
cudaMalloc((void **) &d_c, sizeof(int)*n*n);
// copy matrix A and B from host to device memory
cudaMemcpy(d_a, h_a, sizeof(int)*n*n, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, sizeof(int)*n*n, cudaMemcpyHostToDevice);
unsigned int grid_rows = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
unsigned int grid_cols = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 dimGrid(grid_cols, grid_rows);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
cudaEventRecord(start, 0);
gpu_matrix_mult<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, n);
cudaThreadSynchronize();
// time counting terminate
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
// Transfer results from device to host
cudaMemcpy(h_cc, d_c, sizeof(int)*n*n, cudaMemcpyDeviceToHost);
// compute time elapsed on GPU computing
cudaEventElapsedTime(&naive_gpu_elapsed_time_ms, start, stop);
printf("Time elapsed on naive GPU matrix multiplication of %dx%d . %dx%d : %f ms.\n\n", n, n, n, n, naive_gpu_elapsed_time_ms);
// free memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
Free(h_a);
Free(h_b);
Free(h_c);
Free(h_cc);
}
return 0;
}
Note: if you are interested to learn profiling, then click here. Profiling means they give some statistics information of execution time of the function and other things.
Summary :
-we learn the basic format of the matrix.
-Another, learn matrix multiplication.
-learn the complexity.
-learn the parallel code for matrix multiplication.
If you want to optimize this parallel code, we use another method called Tiling. In this way, we view another memory called shared memory. Above, we access the global memory, which takes 100 cycles or more but shared memory holds only 2-4 to fetch data. Besides, shared memory has limited storage; we need to think of 'how to use the memory efficiently' so we increase the program's scalability.
Comments
Post a Comment
Any Query Regarding Article Ask me in comment