CUDA | Notion

Questions

Difference Between SIMD and SPMD

In an SPMD system, the parallel processing units execute the same program on multiple parts of the data. However, these processing units do not need to be executing the same instruction at the same time. In an SIMD system, all processing units are executing the same instruction at any instant
What is CUDA C, and how does it differ from traditional programming languages?
- Answer: CUDA C is an extension of the C programming language designed by NVIDIA for parallel programming on GPUs. It includes features for writing parallel code and exploiting the parallel architecture of NVIDIA GPUs. Unlike traditional programming languages, CUDA C allows developers to explicitly parallelize their code.
Explain the key components of a CUDA C program.
- Answer: A CUDA C program consists of both host code (executed on the CPU) and device code (executed on the GPU). Kernel functions, which are executed on the GPU, are a crucial component. CUDA C programs also include memory allocation and data transfer between the CPU and GPU.
What is a CUDA kernel, and how is it executed on the GPU?
- Answer: A CUDA kernel is a function that runs on the GPU. It is invoked by the host code and executed in parallel by numerous threads on the GPU. Each thread executes the same kernel code but processes different data, and these threads are organized into thread blocks and grids.
Explain the terms "host" and "device" in the context of CUDA C.
- Answer: The host refers to the CPU, and the device refers to the GPU in CUDA C. The host code is executed on the CPU, while the device code (kernels) runs on the GPU. Data is transferred between the host and device as needed.
How does CUDA C handle memory management on the GPU?
- Answer: CUDA C provides functions for allocating and freeing memory on the GPU using cudaMalloc and cudaFree. Additionally, there are specific memory spaces, such as global memory, shared memory, and local memory, which are used for data storage during GPU computations.
Explain the concept of threads, blocks, and grids in CUDA C.
- Answer: Threads are individual units of execution, organized into blocks. Blocks are groups of threads that can communicate and synchronize, and grids are collections of blocks. The organization of threads, blocks, and grids enables parallel execution on the GPU.
How does CUDA C handle data parallelism?
- Answer: CUDA C achieves data parallelism by launching a large number of threads to process different elements of data simultaneously. Each thread executes the same kernel code but processes different data, enabling parallel computation.
What is warp in the context of CUDA C, and why is it important for performance?
- Answer: A warp is a group of threads that execute the same instruction simultaneously on the GPU. Warps allow for efficient parallelism as all threads in a warp execute in lockstep. Performance can be optimized by ensuring that threads within a warp follow similar execution paths.
Explain how CUDA C handles synchronization among threads within a block.
- Answer: CUDA C provides synchronization mechanisms such as __syncthreads() to synchronize threads within a block. This ensures that all threads reach a specific point in the kernel before proceeding, enabling coordinated execution.
What is the role of shared memory in CUDA C, and how does it differ from global memory?
- Answer: Shared memory is a fast and low-latency memory space that is shared among threads within a block. It is used for communication and data sharing between threads. Unlike global memory, which is slower, shared memory provides a faster alternative for frequently accessed data within a block.
Explain the process of launching a CUDA kernel from the host code.
- Answer: To launch a CUDA kernel, the host code uses a specific syntax (<<<gridDim, blockDim>>>) to specify the organization of threads into blocks and grids. The kernel is then called like a regular C function. The specified grid and block dimensions determine how threads are organized for parallel execution.
How does CUDA C handle error checking, and what are common error-handling practices?
- Answer: CUDA C provides error-checking functions like cudaGetLastError to identify and handle errors. It is common practice to check for errors after kernel launches and memory allocations, and to print error messages or take corrective actions when errors occur.
What is the significance of the warp divergence problem in CUDA C, and how can it be mitigated?
- Answer: Warp divergence occurs when threads within a warp take different execution paths. This can lead to decreased performance. Mitigating strategies include avoiding conditional branches within warps or using techniques like thread divergence masking.
Explain how constant memory is used in CUDA C and its advantages.
- Answer: Constant memory is a read-only memory space in CUDA C that provides fast access for all threads. It is suitable for storing constants that are used by all threads in a kernel, reducing the need for repeated memory accesses.
How does CUDA C support asynchronous execution, and why is it beneficial?
- Answer: CUDA C allows for asynchronous execution by using asynchronous memory transfer and kernel launches. Asynchronous execution enables overlapping computation and communication, improving overall performance.
Explain the concept of streaming multiprocessors (SMs) in CUDA architecture and their role in parallel computation.
- Answer: Streaming Multiprocessors (SMs) are processing units within the GPU that execute parallel threads. Each SM contains multiple CUDA cores that perform parallel computation. SMs play a crucial role in achieving parallelism in CUDA C programs.
What is the purpose of warp-synchronous programming in CUDA C?
- Answer: Warp-synchronous programming in CUDA C involves coordinating the execution of threads within a warp. It allows for efficient use of GPU resources and helps in avoiding warp divergence.
How does CUDA C handle dynamic parallelism, and in what scenarios is it useful?
- Answer: Dynamic parallelism in CUDA C allows kernels to launch other kernels. It is useful when the workload is not known at compile time or when hierarchical parallelism is required. It enables more flexible and adaptive GPU programming.