병렬프로그래밍과 Cuda

  • View
    2.633

  • Download
    4

Embed Size (px)

Transcript

  1. 1. CUDA icysword@nate.com Parallel Programming & CUDA
  2. 2. 2014-04-23 2 (icysword@nate.com) 1. 2. 2.1. Core 2.2. Thread 2.3. Process 2.4. GPGPU 3. CUDA ? 4. CUDA 5. 6.
  3. 3. 2014-04-23 3 (icysword@nate.com) Moore 18 2 CO-FOUNDER, INTEL Intel Core i3 / i5 / i7 2010 1 Clarkdale - i3 : 3.3 GHz - i5 : 3.6 GHz - i7 : 3.2 GHz 2014 2 Haswell - i3 : 2.4 GHz - i5 : 2.9 GHz - i7 : 3.3 GHz
  4. 4. 2014-04-23 4 (icysword@nate.com) Multicore Processor i7 i5 i3 Cores 4 4 2 Threads 8 4 4
  5. 5. 2014-04-23 5 (icysword@nate.com) =0 - =0 = = + - Loop (Fibonacci Sequence)
  6. 6. 2014-04-23 6 (icysword@nate.com) Core Thread Process GPGPU
  7. 7. 2014-04-23 7 (icysword@nate.com) Core SIMD (Single Instruction Multiple Data) int a = 1 + 2; int b = 3 + 4; int c = 5 + 6; int d = 7 + 8; SISD 1 2a = + 3 4b = + 5 6c = + 7 8d = + SIMD 1 3 5 7 2 4 6 8= +a b c d Memory Registor
  8. 8. 2014-04-23 8 (icysword@nate.com) Thread OpenMP pthreads Parallel Pattern Library Loop Section
  9. 9. 2014-04-23 9 (icysword@nate.com) Process MPI HPF PVM
  10. 10. 2014-04-23 10 (icysword@nate.com) GPGPU
  11. 11. 2014-04-23 11 (icysword@nate.com) GPGPU Floating-Point Operations per Second for the CPU and GPU GTX 770 : 3.2 Tera FLOPs i7 : 141 Giga FLOPs 22.6
  12. 12. 2014-04-23 12 (icysword@nate.com) GPGPU Memory Bandwidth for the CPU and GPU GTX 770 : 224.3 GB/s i7 : 25.6 GB/s 8.8
  13. 13. 2014-04-23 13 (icysword@nate.com) CUDA ? (Compute Unified Device Architecture) 2006 11 GeForce 8800 GTX GPU Geforce : Graphic Card Quadro : Graphic Tesla : Graphic
  14. 14. 2014-04-23 14 (icysword@nate.com) CPU vs GPU The GPU Devotes More Transistors to Data Processing
  15. 15. 2014-04-23 15 (icysword@nate.com) CUDA Hardware Archetecture SM (Streaming Multiprocessor) CUDA Core (Streaming Processor)
  16. 16. 2014-04-23 16 (icysword@nate.com) CUDA Data Parallel Threading Model Block : (SM) Warp : SM Thread Grid : Block Thread : Block (CUDA Core)
  17. 17. 2014-04-23 17 (icysword@nate.com) Structure of CUDA Memory Registor - On Chip Processor Memory - Local ( Global) - Shared Memory - On Chip Processor Memory - SM Thread - L1 Constant Memory - - Write (from DRAM) : 400 ~ 600 Cycles - Read : Registor Global Memory - Video Card DRAM - Read/Write : 400 ~ 600 Cycles Texture Memory - Global Memory - Register Shared Memory Constant Memory Global Memory Texture Memory
  18. 18. 2014-04-23 18 (icysword@nate.com) CUDA Streaming Host DRAM GPU DRAM Data Data Data Data
  19. 19. 2014-04-23 19 (icysword@nate.com) CUDA Library cuRAND : (CUDA Random Number Generation library) CUFFT : FFT Library (CUDA Fast Fourier Transform library) CUBLAS : Library (CUDA Basic Linear Algebra Subroutines library)
  20. 20. 2014-04-23 20 (icysword@nate.com) CUDA Programming float fResult[1024][1000]; float fData[1024][1000]; for (int i = 0; i < 1024; i++) { for (int j = 0; j < 1000; j++) { for (int k = 0; k < 33; k++) { fResult[i][j] += Calc(fData[i][j], k); } } } int main() { float *dev_fResult, *dev_fData; int iSizeData = 1024 * 1000 * sizeof(float); cudaMalloc((void**)&dev_fResult, iSizeData); cudaMemset(dev_fResult, 0, iSizeData); cudaMalloc((void**)&dev_fData, iSizeData); cudaMemcpy(dev_fData, fData, iSizeData, cudaMemcpyHostToDevice); KernelFunc(dev_fResult, dev_fData); float* fResult = new float[1024 * 1000]; cudaMemcpy(fResult, dev_fResult, iSizeData, cudaMemcpyDeviceToHost); cudaFree(dev_fResult); cudaFree(dev_fData); } // : CPU // GPU // Device : GPU // Host : CPU // Atomic : Mutal Exclusion // Device // Kernel // Host __global__ void KernelFunc(float* i_fResult, float* i_fData) { float fResult = 0.0f; int index = blockIdx.x * gridDim + threadIdx.x; float fData = i_fData[index]; for (int k = 0; k < 33; k++) { fResult += Calc(fData, k); } i_fResult[index] = fResult; }
  21. 21. 2014-04-23 21 (icysword@nate.com) CUDA Compute capability (version) http://en.wikipedia.org/wiki/CUDA . . . Technical specifications Compute capability (version) 1 1.1 1.2 1.3 2.x 3 3.5 5 8800 GTX 8400M GT GTS 350M GTX 280 GTX 550 GTX 770 GTX TITAN GTX 750 Maximum dimensionality of grid of thread blocks 2 3 Maximum x-, y-, or z-dimension of a grid of thread blocks 65535 231 -1 Maximum dimensionality of thread block 3 Maximum x- or y-dimension of a block 512 1024 Maximum z-dimension of a block 64 Maximum number of threads per block 512 1024 Warp size 32 Maximum number of resident blocks per multiprocessor 8 16 32 Maximum number of resident warps per multiprocessor 24 32 48 64 Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Number of 32-bit registers per multiprocessor 8 K 16 K 32 K 64 K Maximum number of 32-bit registers per thread 128 63 255 Maximum amount of shared memory per multiprocessor 16 KB 48 KB 64 KB
  22. 22. 2014-04-23 22 (icysword@nate.com) CUDA 6.0 : Unified Memory http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6
  23. 23. 2014-04-23 23 (icysword@nate.com) CUDA : 1 Beam Pattern - 1000 x 1000 Bin - 100 Sensor ( BeamPattern ) - Data CPU : 32 s (OpenMP) CUDA : 0.26 s 123
  24. 24. 2014-04-23 24 (icysword@nate.com) CUDA : 2 Beam Pattern - 192 x 1000 Bin - 16 - Data Read/Write : 192 x 1000 x floatComplex x 2 (LOFAR/DEMON) CPU : 900 ms (OpenMP) CUDA : 14 ms 64
  25. 25. 2014-04-23 25 (icysword@nate.com) CUDA : Cubic Spline Interoplation CPU Xi(14) Xi(15) Xi(16) Xo(n) Xo(n+1) - : Sequential Operation - : Sequential
  26. 26. 2014-04-23 26 (icysword@nate.com) Raspberry-Pi - 700 Mhz ARM11 CPU - Broadcom Videocore IV GPU - 256 Mbytes RAM - 10.13 GFLOPs - i7 : 141 GFLOPs - : 107 TFLOPs http://www.raspberrypi.org
  27. 27. 2014-04-23 27 (icysword@nate.com) Multi-GPU Motherboard http://prod.danawa.com/info/?pcode=2466508&cate1=861&cate2=875&cate3=968&cate4=0