병렬프로그래밍과 Cuda

  • Published on
    20-Aug-2015

  • View
    2.633

  • Download
    4

Embed Size (px)

Transcript

<ol><li> 1. CUDA icysword@nate.com Parallel Programming &amp; CUDA </li><li> 2. 2014-04-23 2 (icysword@nate.com) 1. 2. 2.1. Core 2.2. Thread 2.3. Process 2.4. GPGPU 3. CUDA ? 4. CUDA 5. 6. </li><li> 3. 2014-04-23 3 (icysword@nate.com) Moore 18 2 CO-FOUNDER, INTEL Intel Core i3 / i5 / i7 2010 1 Clarkdale - i3 : 3.3 GHz - i5 : 3.6 GHz - i7 : 3.2 GHz 2014 2 Haswell - i3 : 2.4 GHz - i5 : 2.9 GHz - i7 : 3.3 GHz </li><li> 4. 2014-04-23 4 (icysword@nate.com) Multicore Processor i7 i5 i3 Cores 4 4 2 Threads 8 4 4 </li><li> 5. 2014-04-23 5 (icysword@nate.com) =0 - =0 = = + - Loop (Fibonacci Sequence) </li><li> 6. 2014-04-23 6 (icysword@nate.com) Core Thread Process GPGPU </li><li> 7. 2014-04-23 7 (icysword@nate.com) Core SIMD (Single Instruction Multiple Data) int a = 1 + 2; int b = 3 + 4; int c = 5 + 6; int d = 7 + 8; SISD 1 2a = + 3 4b = + 5 6c = + 7 8d = + SIMD 1 3 5 7 2 4 6 8= +a b c d Memory Registor </li><li> 8. 2014-04-23 8 (icysword@nate.com) Thread OpenMP pthreads Parallel Pattern Library Loop Section </li><li> 9. 2014-04-23 9 (icysword@nate.com) Process MPI HPF PVM </li><li> 10. 2014-04-23 10 (icysword@nate.com) GPGPU </li><li> 11. 2014-04-23 11 (icysword@nate.com) GPGPU Floating-Point Operations per Second for the CPU and GPU GTX 770 : 3.2 Tera FLOPs i7 : 141 Giga FLOPs 22.6 </li><li> 12. 2014-04-23 12 (icysword@nate.com) GPGPU Memory Bandwidth for the CPU and GPU GTX 770 : 224.3 GB/s i7 : 25.6 GB/s 8.8 </li><li> 13. 2014-04-23 13 (icysword@nate.com) CUDA ? (Compute Unified Device Architecture) 2006 11 GeForce 8800 GTX GPU Geforce : Graphic Card Quadro : Graphic Tesla : Graphic </li><li> 14. 2014-04-23 14 (icysword@nate.com) CPU vs GPU The GPU Devotes More Transistors to Data Processing </li><li> 15. 2014-04-23 15 (icysword@nate.com) CUDA Hardware Archetecture SM (Streaming Multiprocessor) CUDA Core (Streaming Processor) </li><li> 16. 2014-04-23 16 (icysword@nate.com) CUDA Data Parallel Threading Model Block : (SM) Warp : SM Thread Grid : Block Thread : Block (CUDA Core) </li><li> 17. 2014-04-23 17 (icysword@nate.com) Structure of CUDA Memory Registor - On Chip Processor Memory - Local ( Global) - Shared Memory - On Chip Processor Memory - SM Thread - L1 Constant Memory - - Write (from DRAM) : 400 ~ 600 Cycles - Read : Registor Global Memory - Video Card DRAM - Read/Write : 400 ~ 600 Cycles Texture Memory - Global Memory - Register Shared Memory Constant Memory Global Memory Texture Memory </li><li> 18. 2014-04-23 18 (icysword@nate.com) CUDA Streaming Host DRAM GPU DRAM Data Data Data Data </li><li> 19. 2014-04-23 19 (icysword@nate.com) CUDA Library cuRAND : (CUDA Random Number Generation library) CUFFT : FFT Library (CUDA Fast Fourier Transform library) CUBLAS : Library (CUDA Basic Linear Algebra Subroutines library) </li><li> 20. 2014-04-23 20 (icysword@nate.com) CUDA Programming float fResult[1024][1000]; float fData[1024][1000]; for (int i = 0; i &lt; 1024; i++) { for (int j = 0; j &lt; 1000; j++) { for (int k = 0; k &lt; 33; k++) { fResult[i][j] += Calc(fData[i][j], k); } } } int main() { float *dev_fResult, *dev_fData; int iSizeData = 1024 * 1000 * sizeof(float); cudaMalloc((void**)&amp;dev_fResult, iSizeData); cudaMemset(dev_fResult, 0, iSizeData); cudaMalloc((void**)&amp;dev_fData, iSizeData); cudaMemcpy(dev_fData, fData, iSizeData, cudaMemcpyHostToDevice); KernelFunc(dev_fResult, dev_fData); float* fResult = new float[1024 * 1000]; cudaMemcpy(fResult, dev_fResult, iSizeData, cudaMemcpyDeviceToHost); cudaFree(dev_fResult); cudaFree(dev_fData); } // : CPU // GPU // Device : GPU // Host : CPU // Atomic : Mutal Exclusion // Device // Kernel // Host __global__ void KernelFunc(float* i_fResult, float* i_fData) { float fResult = 0.0f; int index = blockIdx.x * gridDim + threadIdx.x; float fData = i_fData[index]; for (int k = 0; k &lt; 33; k++) { fResult += Calc(fData, k); } i_fResult[index] = fResult; } </li><li> 21. 2014-04-23 21 (icysword@nate.com) CUDA Compute capability (version) http://en.wikipedia.org/wiki/CUDA . . . Technical specifications Compute capability (version) 1 1.1 1.2 1.3 2.x 3 3.5 5 8800 GTX 8400M GT GTS 350M GTX 280 GTX 550 GTX 770 GTX TITAN GTX 750 Maximum dimensionality of grid of thread blocks 2 3 Maximum x-, y-, or z-dimension of a grid of thread blocks 65535 231 -1 Maximum dimensionality of thread block 3 Maximum x- or y-dimension of a block 512 1024 Maximum z-dimension of a block 64 Maximum number of threads per block 512 1024 Warp size 32 Maximum number of resident blocks per multiprocessor 8 16 32 Maximum number of resident warps per multiprocessor 24 32 48 64 Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Number of 32-bit registers per multiprocessor 8 K 16 K 32 K 64 K Maximum number of 32-bit registers per thread 128 63 255 Maximum amount of shared memory per multiprocessor 16 KB 48 KB 64 KB </li><li> 22. 2014-04-23 22 (icysword@nate.com) CUDA 6.0 : Unified Memory http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6 </li><li> 23. 2014-04-23 23 (icysword@nate.com) CUDA : 1 Beam Pattern - 1000 x 1000 Bin - 100 Sensor ( BeamPattern ) - Data CPU : 32 s (OpenMP) CUDA : 0.26 s 123 </li><li> 24. 2014-04-23 24 (icysword@nate.com) CUDA : 2 Beam Pattern - 192 x 1000 Bin - 16 - Data Read/Write : 192 x 1000 x floatComplex x 2 (LOFAR/DEMON) CPU : 900 ms (OpenMP) CUDA : 14 ms 64 </li><li> 25. 2014-04-23 25 (icysword@nate.com) CUDA : Cubic Spline Interoplation CPU Xi(14) Xi(15) Xi(16) Xo(n) Xo(n+1) - : Sequential Operation - : Sequential </li><li> 26. 2014-04-23 26 (icysword@nate.com) Raspberry-Pi - 700 Mhz ARM11 CPU - Broadcom Videocore IV GPU - 256 Mbytes RAM - 10.13 GFLOPs - i7 : 141 GFLOPs - : 107 TFLOPs http://www.raspberrypi.org </li><li> 27. 2014-04-23 27 (icysword@nate.com) Multi-GPU Motherboard http://prod.danawa.com/info/?pcode=2466508&amp;cate1=861&amp;cate2=875&amp;cate3=968&amp;cate4=0 </li></ol>