Final Thesis

  • Published on
    11-May-2015

  • View
    146

  • Download
    1

Embed Size (px)

DESCRIPTION

GPGPU driven simulations of zero-temperature 1D Ising model with Glauber dynamics

Transcript

<ul><li>1.POLITECHNIKA WROCAWSKA WYDZIA INFORMATYKI I ZARZDZANIA GPGPU driven simulations of zero-temperature 1D Ising model with Glauber dynamics Daniel Kosalla FINAL THESIS under supervision of Dr in. Dariusz Konieczny Wrocaw 2013 </li></ul><p>2. Acknowledgments: Dr in. Dariusz Konieczny 3. Contents 1. Motivation 5 2. Target 5 3. Scope of work 5 4. Theoretical background and proposed model 6 4.1. Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Historic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.4. Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.5. Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.6. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.7. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. General Purpose Graphic Processing Units 10 5.1. History of General Purpose GPUs . . . . . . . . . . . . . . . . . . . . . . 10 5.2. CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6. CPU Simulations 14 6.1. Sequential algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6.2. Random number generation on CPU . . . . . . . . . . . . . . . . . . . . 15 6.3. CPU performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 7. GPU Simulations - thread per simulation 17 7.1. Thread per simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 7.2. Running the simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.3. Solution space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.4. Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.5. Thread per simulation - static memory . . . . . . . . . . . . . . . . . . . 20 7.6. Comparison of static and dynamic memory use . . . . . . . . . . . . . . . 21 8. GPU Simulations - thread per spin 24 8.1. Thread per spin approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8.2. Concurrent execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iii 4. 8.3. Thread communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.4. Race conditions with shared memory . . . . . . . . . . . . . . . . . . . . 27 8.5. Thread per spin approach - reduction . . . . . . . . . . . . . . . . . . . . 27 8.6. Thread per spin approach - ags . . . . . . . . . . . . . . . . . . . . . . 29 8.7. Thread-per-spin performance . . . . . . . . . . . . . . . . . . . . . . . . 30 8.8. Thread-per-spin vs thread-per-simulation performance . . . . . . . . . . . 31 9. Bond density for some W0 values 34 10.Conclusions 36 11.Future work 36 Appendix 38 A. Sequential algorithm - CPU 39 B. Thread per simulation - no optimizations 43 C. Thread per simulation - static memory 48 D. Thread per spin - no optimizations 53 E. Thread per spin - parallel reduction 58 F. Thread per spin - update ag 63 iv 5. 1. Motivation In the presence of recent developments of SCM (Single Chain Magnets) [14] the issue of criticality in 1D Ising-like magnet chains has turned out to be an promising eld of study [58]. Some practical applications has been already suggested [2]. Unfortunately, the details of general mechanism driving this changes in real world is yet to be discovered. Traditionaly, Monte Carlo Simulations regarding Ising model were conducted on CPUs1 . However, in the presence of powerful GPGPUs2 new trend in scientic computations was started enabling more detailed and even faster calculations. 2. Target The following document describes developed GPGPU applications capable of pro- ducing insights into underlying physical problem, examination of dierent approaches of conducting Monte Carlo simulations on GPGPU and comparison between developed parallel GPGPU algorithms and sequential CPU-based approach. 3. Scope of work The scope of this document includes development of 5 parallel GPGPU algorithms, namely: Thread-per-simulation algorithm Thread-per-simulation algorithm with static memory Thread-per-spin algorithm Thread-per-spin algorithm with ags Thread-per-spin algorithm with reduction 1 CPU - Central Processing Unit 2 GPGPU - General Purpose Graphics Processing Unit 5 6. 4. Theoretical background and proposed model 4.1. Ising model Although initially proposed by Wilhelm Lenz, it was Ernst Ising[10], who developed a mathematical model for ferromagnetic phenomena. Ising model is usually represented by means of lattice of spins - discrete variables {1, 1}, representing magnetic dipole moments of molecules in the material. The spins are then interacting with its neighbours, which may cause the phase transition of the whole lattice. 4.2. Historic methods Monte Carlo Simulation (MC) on Ising model consist of a sequence of lattice updates. Traditionally all (synchronous) or single (sequential) spins are updated in each iteration producing the lattice-state for future iterations. The update methods are based on the so called dynamics that are describing spin interactions. 4.3. Updating The idea of partially synchronous updating scheme has been suggested [57]. This c-synchronous mode has a xed parameter of spins being updated in one step-time. However, one can imagine, that the number of updated spins/molecules (often referred to as cL, where: L denotes size of the chain and c (0, 1]) is changing as the simulation progresses. If so, then it is either linked to some characteristics of the system or may be expressed with some probability distribution (described in subsection 4.5). This approach of changing c parameter can be applied while choosing spins randomly as well as in cluster (subsection 4.6) but only the later will be considered in this document. 4.4. Simulations In the proposed model cL sequential updating is used with c due to provided distribution. The considered environment consist of one dimensional array of L spins si = 1. Index of each spin is denoted by i = 1, 2, . . . , L. Periodic boundary conditions are assumed, i.e. sL+1 = s1. It has been shown in [8] that the system under synchronous Glauber dynamics reaches one of two absorbing states - ferromagnetic or antiferromagnetic. Therefore, lets introduce density of bonds () as an order parameter: = Lq i=1 (1 sisi+1) 2L (4.1) 6 7. As stated in [8] phase transitions in synchronous updating modes and c-sequential [7] ought to be rather continuous (in cases dierent then c = 1 for the later). Smooth phase transition can be observed in the Figure 4.1. Figure 4.1. The average density of active bonds in the stationary state &lt; st &gt; as a function of W0 for c = 0.9 and several lattice sizes L. [7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero- temperature Glauber dynamics under partially synchronous updates, Phys. Rev. E 86, 051113 (2012) The system is considered in low temperatures (T) and therefore T = 0 can be assumed. Metropolis algorithm can be considered as a special case of zero-temperature Glauber dynamics for 1/2 spins. Each spin is ipped (si = si) with rate W(E) per unit time. While T = 0: W(E) = Y ____] ____[ 1 if E &lt; 0, W0 if E = 0, 0 if E &gt; 0 (4.3) In the case of T = 0, the ordering parameter W0 = [0; 1] (e.g. Glauber rate - W0 = 1/2 or Metropolis rate W0 = 1) is assumed to be constant. One can imagine that even W0 parameter can in fact be changed during simulation process but thats out of scope of proposed model. System starts in the fully ferromagnetic state ( = f = 0). After each time-step changes are applied to the system and the next time-step is being evaluated. After predetermined number of time steps state of the system is investigated. If the chain has obtained antiferromagnetic state ( = af = 1) or suciently large number of time-steps has been inconclusive then whole simulation is being shout down. 4.5. Distributions During the simulation c will not be xed in time but rather change from [0; 1] according to triangular continuous probability distribution[9] presented in the Figure 4.2. While studying dierent initial conditions for simulations, distributions are to be adjusted in order to provide peak values in range {0, 1}. This is due to the fact that 7 8. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 Figure 4.2. c could be any value in the interval [0; 1] but is most likely to around value of c = 1/2. Other values possible but their probabilities are inversely proportional to their distance from c = 1/2. the value of 0.5 (as presented in the plot) would mean that in each time-step half of the spins gets to be updated. 4.6. Updating The following algorithms make use of triangular probability distribution to assign appropriate c value before each time step. After (on average) L updated spins each Monte Carlo Step (MCS) can be distinguished. 4.7. Algorithm Transformation of the above-mentioned rules into set of instructions could yield in following description or pseudocode (below): Update cL consecutive spins starting from randomly chosen one. Each change is saved to the new array rather than the old one. After each Stop updated spins are saved and new time-step can be started. 1. Assign c value with given distribution 2. Choose random value of i [0, L] 3. max = i + cL 4. si is i-th spin if si = si+1 si = si1 : s i = si+1 = si1 otherwise: Flip si with probability W0 8 9. 5. if i max i = i + 1 Go to step 4 6. Stop 9 10. 5. General Purpose Graphic Processing Units 5.1. History of General Purpose GPUs Traditionally, in desktop computer GPU is highly specialized electronic circuit designed to robustly handle 2D and 3D graphics. In 1992 Silicon Graphics, released OpenGL library. OpenGL was meant as standardised, platform-independent interface for writing 3D graphics. By the mid 1990s an increasing demand for 3D applications appeared in the customer market. It was NVIDIA who developed GeForce 256 and branded it as the words rst GPU3 . GeForce 256, although, one of many graphical accelerators was one that showed a very rapid increase in the eld incorporating many features such as transform and lighting computations directly on graphics processor. The release of GPUs capable of coping with programmable pipelines attracted researchers to explore the possibility of using graphical processors outside their original use scheme. Although, early GPUs of early 2000s were programmable in a way that enabled for pixel manipulation, researchers noticed that since this manipulations could actually represent any kind of operations and pixels could virtually represent any kind of data. In the late 2006 NVIDIA revealed GeForce 8800 GTX, the rst GPU built with CUDA Architecture. CUDA Architecture enables programmer to use every arithmetic logic unit4 on the GPU (as opposed to early days of GPGPU when the access to ALUs was granted only via the restricted and complicated interface of OpenGL and DirectX). The new family of GPUs started with 8800 GTX was built with IEEE compliant ALUs capable of single-precision oating-point arithmetics. Moreover, new ALUs were equipped not only with extended set of instructions that could be used in general purpose computing but also enabled for the arbitrary read and write operations to device memory. Few months after the lunch of 8800 GTX NVIDIA published a compiler that took standard C language extended with some additional keywords and transformed it into fully featured GPU code capable of general purpose processing. It is important to stress that currently used CUDA C is by far easier to use then OpenGL/DirectX. Programmers do not have to disguise their data for graphics and can use industry-standard C or even other languages like C#, Java or Python (via appropriate bindings). CUDA in now used in various elds of science raging from medical imaging, uid dynamics to environmental science and others oering enormous, several-orders-of- magnitude speed ups5 . GPUs are not only faster then CPUs in terms of computed data 3 http://www.nvidia.com/page/geforce256.html 4 ALU - Arithmetic Logic Unit 5 http://www.nvidia.com/object/cuda-apps-flash-new-changed.html 10 11. per unit time (e.g. FLOPS6 ) but also in terms of power and cost eciency. 5.2. CUDA Architecture The underlying architecture of CUDA is driven by design decisions connected with GPUs primary purpose, that is graphics processing. Graphics processing is usually highly parallel process. Therefore, GPU also works in parallel fashion. The important distinction can be made into logical and physical layer of GPU architecture. Programmer decomposes computational problem into atomic processes (threads) that can be executed simultaneously. Since this partition usually results in creation of hundreds, thousands or even millions if threads. For programmer convenience threads can be organized inside blocks which in turn are part of blocks. Both, blocks and grids are 3 dimensional structures. This spatial dimensions are introduced for easier problem decomposition. As mentioned before: GPU is meant for graphics processing which is usually related to processing 2D or 3D sets of data. This grouping is associated not only with logical decomposition of problems, but also with physical structure of GPU. A basic unit of execution on GPU is the warp. Warp consist of 32 threads. Each thread in warp belongs to the same block. If the block is bigger then warp size then threads are divided between several warps. The warps are executed on the executional unit called Streaming Multiprocessors (SMs). Each SM executes several warps (not necessarily from the same block). Physically, each SM consist of 8 streaming processors (SP, CUDA cores) and 32 basic ALUs. 8 SPs spend 4 clock cycles executing the same processor instruction enabling 32 threads in warp to execute in parallel. Each of the threads in warp can (and usually do) have dierent data supplied to them forming whats known as SIMD7 architecture. 6 FLOPS - Floating Point Operations Per Second 7 Single Instruction, Multiple Data 11 12. Figure 5.1. Grid of thread blocks http://docs.nvidia.com/cuda/cuda-c-programming-guide/ CUDA also provides rich memory hierarchy available for every thread. Each of the memory spaces has its own characteristics. The fastest and the smallest memory memory is the per-thread local memory. Unfortunately, local, register-based memory is out of reach for CUDA programmer and is used automatically. Each thread in block can make use of shared memory. This memory can be accessed by dierent threads in block and is usually the main medium of inter-thread communication. The slowest memory spaces (but available to every thread) are called global, constant and texture respectively, each of them have dierent size and purpose but they are all persistent across kernel launches by the same application. 12 13. Figure 5.2. CUDA memory hierarchy http://docs.nvidia.com/cuda/cuda-c-progra...</p>