Artigo Fpga vs Cuda

  • Published on

  • View

  • Download

Embed Size (px)




<ul><li><p>Image Convolution Processing: a GPU versus FPGA Comparison </p><p>Lucas M. Russo, Emerson C. Pedrino, Edilson Kato Federal University of Sao Carlos - DC </p><p>Rodovia Washington Lus, km 235 - SP-310 13565-905 So Carlos - So Paulo - Brazil </p><p>; emerson, </p><p>Valentin Obac Roda Federal University of Rio Grande do Norte - DEE </p><p>Campus Universitrio Lagoa Nova 59072-970 Natal Rio Grande do Norte Brazil </p><p> </p><p>AbstractConvolution is one of the most important operators used in image processing. With the constant need to increase the performance in high-end applications and the rise and popularity of parallel architectures, such as GPUs and the ones implemented in FPGAs, comes the necessity to compare these architectures in order to determine which of them performs better and in what scenario. In this article, convolution was implemented in each of the aforementioned architectures with the following languages: CUDA for GPUs and Verilog for FPGAs. In addition, the same algorithms were also implemented in MATLAB, using predefined operations and in C using a regular x86 quad-core processor. Comparative performance measures, considering the execution time and the clock ratio, were taken and commented in the paper. Overall, it was possible to achieve a CUDA speedup of roughly 200x in comparison to C, 70x in comparison to Matlab and 20x in comparison to FPGA. </p><p>Keywords- Image processing; Convolution; GPU; CUDA; FPGA </p><p>I. INTRODUCTION In 2006 Nvidia Corporation announced a new general </p><p>purpose parallel computing architecture based on the GPGPU paradigm (General-Purpose Computing on Graphics Processing Units): CUDA (Compute Unified Device Architecture) [1]. CUDA is an architecture classified as GPGPU, and it is a category of the SPMD (single process, multiple data; or single program, multiple data) parallel programming, the model is based on the execution of the same program by different processors, supplied with different input data, without the strict coordination requirement among them that the SIMD (single instruction, multiple data) model imposes. As a central point to the model are the so called kernels: C-style functions that are parallel executed through multiple threads and, when called from the application, dynamically allocate a hierarchy processing structure specified by the user. Interchangeably with the execution of the kernels, portions of sequential code are usually inserted in a CUDA program flow. For this reason, it constitutes a heterogeneous programming model. </p><p>The CUDA model was conceived to implement the so called transparent scalability effectively, i.e., the ability of the programming model to adapt itself in the available hardware in such a way that more processors can be scalable without altering the algorithm and, at the same time, reduce the development time of parallel or heterogeneous solutions. All aforementioned model abstractions are particularly suitable and </p><p>easily adapted to the field of digital image processing, given that many applications in this area operate in independent pixel by pixel or pixel window approach. </p><p>Many years before the advent of the CUDA architecture, Xilinx in 1985 made available to the market the first FPGA chip [2]. The FPGA is basically, a highly customizable integrated chip that has been used in a variety of science fields, such as: digital signal processing, voice recognition, bioinformatics, computer vision, digital image processing and other applications that require high performance: real time systems and high performance computing. </p><p>The comparison between CUDA and FPGA has been documented in various works in different applications domains. Asano et al [3] compared the use of CUDA and FPGAs in image processing applications, namely two-dimensional filters, stereo vision and k-means clustering; Che et al [4] compared their use in three applications algorithms: Gaussian Elimination, Data Encryption Standard (DES), and Needleman-Wunsch; Kestur et al [5] developed a comparison for BLAS (Basic Linear Algebra Subroutines); Park et al [6] analyzed the performance of integer and floating-point algorithms and Weber et al [7] compared the architectures using a Quantum Monte Carlo Application. </p><p>In this work, CUDA and a FPGA dedicated architecture will be used and compared on the implementation of the convolution, an operation often used for image processing. </p><p>II. METHODOLOGY All CPU (i.e., Matlab and C) and GPU (i.e., CUDA) </p><p>execution times were obtained from the following configuration: </p><p>Component Description </p><p>Hardware </p><p>Processor: Intel Core i5 750 (8MB cache L2), Motherboard: ASUS P7P55DE-PRO; RAM Memory: 2 x 2 GB Corsair (DDR2-800); Graphics Board: XFX Nvidia GTX 295, 896MB </p><p>Software Windows 7 Professional 64-bit; Visual Studio 2008 SP1 </p><p>Drivers Nvidia driver video version: 190.38; Nvidia CUDA toolkit version: 2.3 </p><p>FPGA </p><p>Cyclone II EP2C35F672 on Terasic DE2 board; Quartus II 10.1 Software with SOPC Builder, NIOS II EDS 10.1 and ModelSim 6.6d Simulation Tool, for the implementation of the algorithms. </p><p>Sponsors: FAPESP grants number 2010/04675-4 and 2009/17736-4; DC/UFSCAR; DEE UFRN </p><p>978-1-4673-0186-2/12/$31.00 2012 IEEE </p></li><li><p>The main comparison parameters presented in this article are the execution time and the number of clock cycles of the implemented algorithms. In order to obtain that, different approaches were used according to the architecture profiled. </p><p>On C, the Performance Counters were used through the functions: QueryPerformanceCounter() and QueryPerformance Frequency(). The former is used to extract the value of the counter until the function call. </p><p>On CUDA, the Event Management provides functionality to create, destroy and record an event. Hence, it is possible to measure the amount of time it took to execute a specific part of code, such as a kernel call, in the manner described in [1]. Concerning the clock cycles, the clock() function was used within the kernel to obtain the measure. </p><p>On Matlab, a simple approach is provided through the usage of a built-in stopwatch. It is possible to control it with the tic and toc syntax. The first starts the timer and the second stops it, displaying the time, in seconds, to execute the statements between tic and toc. The Matlab number of clock cycles was not measured since it was not found a simple way to do it. </p><p>At last, on the FPGA, it is possible to infer the execution time directly from the architecture implemented on it. With the knowledge of the clock rate, explicitly defined by the designer, and the number of clock cycles taken to process the input data, extracted from the waveforms or from the architecture itself, the following expression can be used: execution time = number of clock cycles/clock frequency</p><p> (1) </p><p>III. CONVOLUTION Mathematically, convolution can be expressed as a linear </p><p>combination or sum of products of the mask coefficients with the input function. </p><p>(2) </p><p>Where f denotes the input function and w the mask. It is implicit that equation (2) is applied for every point in the input function. </p><p>It is possible to extend the convolution operation to a 2-D dimension as follows: </p><p>(3) </p><p> There is, in convolution, a limitation in what refers to the boundaries of an input image, since the mask is positioned in such way that there are mask values which do not overlap with the input image. Thus, two approaches are commonly used in the context of image processing: padding the edges of the input image with zeros or clamping the edges of the input </p><p>image with the closest border pixel. In this work the first choice is used as in GONZALES [8]. </p><p>Considering an image of size of MxN pixels, a mask of size SxT the multiplication is the more costly operation. Hence, (MN)(ST) operations are performed and, consequently, the algorithm belongs to O(MNST). </p><p>A mask w(x,y) can be decomposed in w1(x) and w2(y) in such a way that w(x,y) = w1(x) w2(y), where w1(x) is a vector of size (Sx1) and w2(y) is a vector of size (1xT), the 2D convolution can be performed as two 1D convolutions. In this way, it is said that the convolution is separable and the algorithmic complexity decays allowing for a more flexible implementation. Hence, the separable convolution formula can be expressed as in equation 4. </p><p> (4) </p><p>IV. IMPLEMENTATION The separable convolution was implemented in C, CUDA </p><p>and Matlab (built-in function) and the regular convolution was implemented in FPGA [Eq. 3]. The reason to implement the regular convolution in FPGA was due to performance limitations. The separable algorithm, although reducing the total number of operations performed [Eq. 4], requires the image data stream to be processed twice, one for lines and one for columns. Consequently only the column filter itself would take as much time as the regular convolution to process the entire image. The reason for that is due the time required to fill the shift register and the streaming interface, which can transmit only one pixel at clock cycle. </p><p>A. C Implementation The C implementation of convolution was based in [Eq. 4] </p><p>and it is fairly straightforward. Follows the sequential separable algorithm implemented. </p><p> The image was first loaded to memory with the OpenCV C </p><p>library. Later, for each input pixel, the column convolution (with mask w2 and size equal to 2*b+1) was applied to it. </p><p>B. Matlab Implementation For Matlab, the conv2() built-in function was used to </p><p>perform the convolution. </p><p>/* Line Convolution*/ for i 0 to number of lines-1 for j 0 to number of columns -1 g(i, j) 0 for l b to -b if j-l &gt;= 0 and j-l &lt; number of columns g(i, j) g(i, j) + f(i, j - l)*w2(b+l) end-if end-for end-for end-for </p><p>/* Column Convolution*/ for i 0 to number of lines-1 for j 0 to number of columns-1 o(i, j) 0 for k a to a if i-k &gt;= 0 and i-k &lt; number of lines o(i, j) o(i, j) + g(i-k, j)*w1(a+k) end-if end-for end-for end-for </p></li><li><p>C. CO</p><p>diffekernkernalgoreal </p><p> Line</p><p>Tor 4 2-D imagpixedoinreduguar</p><p>F(4x4dispmapwayregioS0,0+concloadload</p><p>Fig</p><p>Asyncare gIn o__sycorre</p><p>Icalcuthe o(Fig</p><p>CimagNUMROWNUMfetch</p><p>CUDA ImplemOn CUDA, therent kernels: nel and the secnel. The develrithm of Podimages as inp</p><p>e Convolution</p><p>The threads wlines by 16 cgrid with siz</p><p>ge. Each threals from the in</p><p>ng this, the uced, as long ranteed. </p><p>Figs. 1 and 2 4) instead of laying purposped to the fir, the thread0on), S0,0+block_s</p><p>+5*block_size (rigcerning the acded for each lded to shared m</p><p>gure 1. Example o</p><p>After the loadchronize their going to acceorder to do yncthreads() ectly. </p><p>In the final sulate four outpones that the . 2). </p><p>Concerning thges are notM_LOADS_PW_X is theM_LOADS_Phed from the </p><p>mentation he algorithm the first part i</p><p>cond one is imlopment of thedlozhnyuk [9]puts. </p><p>n Kernel </p><p>were grouped columns and, ze depending ad in the blocnput image toaccess to thas the mem</p><p>illustrate the f its real dimses. The first lrst line of th,0 is mapped </p><p>size, S0,0+2*block_ght apron rectual block sizine) x 4 (num</p><p>memory. </p><p>of a 2-D block of column </p><p>ding stage, alexecution, si</p><p>ess elements tthat, a call is issued an</p><p>stage, each tput pixels, whthread were </p><p>he flexibility t multiple </p><p>PER_THREADe number oPER_THREAD</p><p>main region. </p><p>is implemenis implemente</p><p>mplemented the convolution, extended to</p><p>in 2-D blocksin turn, they won the dimen</p><p>ck is responsibo per-block shhe Global De</p><p>mory coalescin</p><p>general idea mensions (4 line of the 2-De input imageto the pixel</p><p>_size, S0,0+3*blockgion) and s</p><p>ze, 384, 16x6 mber of block</p><p>size 16 (i.e., 16 tkernels. </p><p>ll threads witince, in the nhat were loadto the CUD</p><p>nd the progr</p><p>thread is assihich are in themapped to in</p><p>of the convolof (BLOCKD), in whichof columns D is the nIn order to s</p><p>nted through ed through thehrough the coln was based ono support any</p><p>s with size (4were grouped</p><p>nsions of the ible for fetchinhared memoryevice Memorng restrictions</p><p>with block sizx 16) for im</p><p>D block (Fig. e (Fig. 2). Inls S0,0 (left ak_size, S0,0+4*bloco on. Moreo(number of p</p><p>k lines), pixel</p><p> threads) for line a</p><p>thin a blocknext stage, thrded by other oDA API funram can pro</p><p>igned the tase same positionn the main re</p><p>lution filter, sK_SIZE_ROWh BLOCK_SI</p><p>per block number of psolve this, the</p><p>two e line lumn n the </p><p>y size </p><p>4x16) d in a input </p><p>ng six y. By ry is s are </p><p>ze of mage 1) is </p><p>n this apron ck_size, over, </p><p>pixels ls are </p><p>and </p><p>must reads ones. </p><p>nction oceed </p><p>sk to ns as egion </p><p>some W_X* IZE_ </p><p>and pixels e line </p><p>kernbegi</p><p>Blast prevexacconsAddautoequa Colu</p><p>Twiththe respshar</p><p>conswarp8 linpixeaproreus</p><p>Inlaun(BLAD)of linumfollo</p><p>D. FF</p><p>on Fand </p><p>TA gMeminteg</p><p>SNIOwrapin [laye</p><p>nel is launcheinning of the i</p><p>rowOffset = w </p><p>By doing thiscolumn will </p><p>viously calculctly which wstruct the rowditionally, thomatically saal four and BL</p><p>umn ConvoluThe threads, ih size 8x16 orgrid size dep</p><p>ponsible for fred memory. </p><p>The decision sidering the </p><p>rp access to cones in the bloels loaded to son pixels/numse, since they w</p><p>n the same wnched againLOCK_SIZE_), in which Bines per block</p><p>mber of fetcheowing offset w</p><p>columnOffse height BL NUM_L</p><p>FPGA ImplemFor the FPGAFig. 3 was devVerilog HDL</p><p>The architecturayscale or bimory, is convger values bet</p><p>Such conversiOS II Fast Copper function[12]. Thereforer and is </p><p>ed again withimage: </p><p>width BLOC NUM_LOA</p><p>, every columbe calculated</p><p>ated. Hence, were the last wOffset, increahe memory tisfied for N</p><p>LOCK_SIZE_</p><p>ution Kernel in the column r 8 lines by 16pends on the ifetching six p</p><p>to 16 columnmemory coalontiguous memock size tend shared memor</p><p>mber of outpuwill be loaded</p><p>way as the linn for im</p><p>_COLUMN_YBLOCK_SIZEk and NUM_Ls per thread f</p><p>was used: </p><p>t=LOCK_SIZE_LOADS_PER_</p><p>mentation A implementaveloped with </p><p>L coding. </p><p>ure is responsiinary image inerted to RAWtween 0 (i.e., b</p><p>ion was perfoore [10], the n for decomprre, this const</p><p>controlled </p><p>h the followin</p><p>CK_SIZE_ROADS_PER_TH</p><p>mn between rd, even if soit is not nececalculated coasing the vericoalescing </p><p>NUM_LOADS_ROW_X equ</p><p>n filter, are div6 columns. Asinput image apixels from th</p><p>ns in this blolescing requirmory positionto reduce thery, reduce the</p><p>ut pixels) andd to others sha</p><p>ne kernel, themages notY*NUM_LOAE_COLUMN_LOADS_PERfrom the imag</p><p>_COLUMN_Y_THREAD </p><p>ation, the arcthe assistance</p><p>ible for the fon JPEG forma</p><p>W format, thatblack) and 25</p><p>ormed by the C library libj</p><p>ressing JPEGtitutes the ap</p><p>by the N</p><p>ng offset from</p><p>OW_X *HREAD</p><p>(5</p><p>rowOffset andome of them ssary to deterolumn in ordification overhrequeriments S_PER_THR</p><p>ual 16. </p><p>vided in 2-D bs in the line keand each threhe input imag</p><p>ock size was mrements (i.e.,s). Conversely</p><p>e number of ae ratio (numb</p><p>d increase meared memories</p><p>e column k...</p></li></ul>