[IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable Logic and Applications (FPL) - FPGA based acceleration of computational fluid flow simulation on unstructured mesh geometry

  • Published on
    25-Feb-2017

  • View
    219

  • Download
    6

Embed Size (px)

Transcript

  • FPGA BASED ACCELERATION OF COMPUTATIONAL FLUID FLOW SIMULATION ON

    UNSTRUCTURED MESH GEOMETRY

    Zoltan Nagy1,2, Csaba Nemes2, Antal Hiba2, Andras Kiss1,2, Arpad Csk3, Peter Szolgay1,2

    1 Computer and Automation Research Institute, Hungarian Academy of Sciences

    Kende u. 13-17, H-1111, Budapest, Hungary.

    email: nagyz@sztaki.hu, kissa@sztaki.hu, szolgay@sztaki.hu2 Faculty of Information Technology, Pazmany Peter Catholic University

    email: nemcs@digitus.itk.ppke.hu, hiban@digitus.itk.ppke.hu3 Department of Mathematics and Computer Science, Szechenyi Istvan University

    email: dr.arpad.csik@gmail.com

    ABSTRACT

    Numerical simulation of complex computational fluid dy-

    namics problems evolving in time plays an important role in

    scientific and engineering applications. Accurate behavior

    of dynamical systems can be understood using large scale

    simulations which traditionally requires expensive super-

    computing facilities. In the paper a Field Programmable

    Gate Array (FPGA) based framework is described to ac-

    celerate simulation of complex physical spatio-temporal

    phenomena. Simulating complicated geometries requires

    unstructured spatial discretization which results in irregular

    memory access patterns severely limiting computing per-

    formance. Data locality is improved by mesh node renum-

    bering technique which results in a sequential memory

    access pattern. Additionally storing a small window of cell-

    centered state values in the on-chip memory of the FPGA

    can increase data reuse and decrease memory bandwidth

    requirements. Generation of the floating-point data path and

    control structure of the arithmetic unit containing dozens of

    operators is a very challenging task when the goal is high

    operating frequency. Efficiency and use of the framework

    is described by a case study solving the Euler equations on

    an unstructured mesh using finite volume technique. On the

    currently available largest FPGA the generated architecture

    contains three processing elements working in parallel pro-

    viding 75 times speedup compared to a high performance

    microprocessor.

    This research project supported by the Janos Bolyai Research Scholar-

    ship of the Hungarian Academy of Sciences and OTKA Grant No. K84267

    and in part by OTKA Grant No. K68322. The support of the grants

    TAMOP-4.2.1.B-11/2/KMR-2011-0002 and TAMOP-4.2.2/B-10/1-2010-

    0014 is gratefully acknowledged.

    1. INTRODUCTION

    The number of transistors integrated into a single integrated

    circuit doubles in every 18 months according to Moores

    Law [1]. On recent devices around 4 billion transistors can

    be implemented. Since the early 2000s the operating fre-

    quency of the conventional high-performance microproces-

    sors has stopped in the 3-4GHz range due to power con-

    sumption limitations. As the number of transistors is in-

    creased significantly the main question is how to use them

    efficiently to create high performance computing systems.

    One possible solution is to integrate several conventional

    processors into the same chip creating a multi-core proces-

    sor. Another way is to use the general purpose comput-

    ing power of Graphics Processing Units (GPU), where hun-

    dreds of simpler processing elements can work in parallel.

    A more efficient way to increase computing power might

    be to design an application specific accelerator and imple-

    ment it on a Field Programmable Gate Array (FPGA). Due

    to increased transistor density, around one hundred double

    precision floating-point adders and multipliers can be im-

    plemented on the recent devices. Several previous studies

    proved the efficiency of field programmable logic devices in

    numerical simulation of various physical phenomena such

    as electromagnetic [2], transient wave [3] [4] and computa-

    tional fluid dynamics [5] [6] simulations.

    The most obvious way to solve complex spatio-temporal

    problems is numerical approximation over a regular mesh

    structure. Practical applications usually contain complex

    boundaries which can be handled by unstructured meshes

    more efficiently. The resolution of the mesh should be in-

    creased near rapid changes in the dynamics or shape of the

    problem. Unfortunately conventional microprocessors have

    around 10% utilization during unstructured mesh computa-tions [7] due to the irregular memory access pattern of grid

    978-1-4673-2256-0/12/$31.00 c2012 IEEE 128

  • data.

    To accelerate the computation, irregular memory access

    patterns should be hidden by temporally storing the relevant

    grid points by using the on-chip memory of the FPGA. Mesh

    points have to be properly ordered to minimize the size of

    this temporary memory. This optimization task is equiva-

    lent to the Matrix Bandwidth Minimization (MBM) prob-

    lem when the mesh is treated as an undirected graph and

    the bandwidth of the adjacency matrix of the graph is mini-

    mized.

    When an explicit finite volume method is used during

    the solution of a PDE, a complex mathematical expression

    must be computed on the neighborhood of each node. The

    complexity of the resulting arithmetic unit is determined by

    the governing equations and the discretization method used.

    Usually the arithmetic unit is constructed using dozens of

    floating-point units which makes manual design very te-

    dious and error-prone. Therefore an automatic tool was de-

    veloped to generate the arithmetic unit using the discretized

    governing equations [8].

    2. RELATED WORK

    Several papers were published in the early 2000s [9, 10]

    dealing with the acceleration of Partial Differential Equa-

    tions (PDE) on unstructured meshes. Most of them were fo-

    cused on accelerating Finite ElementMethods (FEM) where

    the global stiffness matrix is given and the resulting large

    linear system of equations are solved usually by the iterative

    Conjugate Gradient (CG) method. The most time consum-

    ing operation of this method is a sparse matrix vector mul-

    tiplication therefore most of the papers try to accelerate this

    part. Though our architecture is designed for explicit un-

    structured finite volume calculations, examination of these

    architectures is helpful because similar problems arising in

    our case too. For example the adjacency matrix of the mesh

    is sparse and elements of the state vector are accessed in a

    random order.

    Elkurdi et al. [11] proposed a process where the finite

    element sparse matrix was reorganized into a banded form

    in the first step. Then the matrix vector multiplication was

    calculated along special pipelineable diagonal stripes, where

    two successive elements could be processed by a pipelined

    architecture. Performance of the architecture was deter-

    mined by the available memory bandwidth and the sparsity

    of the matrix, however utilization of the processing elements

    was varying in a very wide range between 17.74 86.24%.Recently Nagar et al. [12] proposed an architecture us-

    ing an optimized Streaming Multiply- Accumulator with

    separate cache memories for matrix and vector data. The

    implementation platform they used has special memory

    architecture providing high 20GB/s peak memory band-

    width. Performance of the system with four FPGAs is in

    the 1.17 3.94GFLOPs range outperforming a Tesla 1070GPU. However utilization of the PEs is around 50% simi-larly to the previous architectures and increasing the number

    of PEs to occupy the entire FPGA still runs into a memory

    bandwidth limit.

    The surveyed architectures provide general solutions to

    accelerate FEM calculations on FPGAs but suffer from the

    inherent high memory bandwidth requirement and the small

    FLOPs per loaded bytes ratio of sparse matrix vector mul-

    tiplication. On the other hand utilization of the execution

    units depends on the structure of the sparse matrix.

    Irregular memory access pattern and high memory band-

    width requirement can be eliminated by storing a small part

    of the grid on-chip and reordering its nodes. Right hand side

    of the discretized equations should be computed for each

    volume in every timestep, which requires several floating-

    point operations, resulting in better FLOPs per loaded bytes

    ratio and higher utilization of FPGA resources.

    3. FLUID FLOWS

    A wide range of industrial processes and scientific phenom-

    ena involve gas or fluids flows over complex obstacles, e.g.

    air flow around vehicles and buildings, the flow of water

    in the oceans or liquid in BioMEMS. In such applications

    the temporal evolution of non-ideal, compressible fluids is

    quite often modelled by the system of Navier-Stokes equa-

    tions. The model is based on the fundamental laws of mass-,

    momentum- and energy conservation, including the dissipa-

    tive effects of viscosity, diffusion and heat conduction. By

    neglecting all non-ideal processes and assuming adiabatic

    variations, we obtain the Euler equations [13, 14], describ-

    ing the dynamics of dissipation-free, inviscid, compressible

    fluids. They are a coupled set of nonlinear hyperbolic partial

    differential equations, in conservative form expressed as

    t+ (v) = 0 (1a)

    (v)

    t+

    (vv + Ip

    )= 0 (1b)

    E

    t+ ((E + p)v) = 0 (1c)

    where t denotes time, is the Nabla operator, is the den-sity, u, v are the x- and y-component of the velocity vectorv, respectively, p is the thermal pressure of the fluid, I is theidentity matrix, and E is the total energy density defined by

    E =p

    1 +1

    2v v. (1d)

    In equation (1d) the value of the ratio of specific heats is

    taken to be = 1.4. For later use we introduce the conser-vative state vector U = [, u, v, E]T , the set of prim-itive variables P = [, u, v, p]T and the speed of sound

    129

  • c =p/. It is also convenient to merge (1a), (1b) and

    (1c) into hyperbolic conservation law form in terms of U and

    the flux tensor

    F =

    vvv + Ip(E + p)v

    (2)

    as:U

    t+ F = 0. (3)

    3.1. Discretization of the governing equations

    Logically structured arrangement of data is a convenient

    choice for the efficient operation of the FPGA based im-

    plementations [6]. However, structured data representation

    is not flexible for the spatial discretization of complex ge-

    ometries. As one of the main innovative contributions of

    this paper, here we consider an unstructured, cell-centered

    representation of physical quantities. In the following para-

    graphs we describe the mesh geometry, the update scheme,

    and the main features of the numerical algorithm.

    The computational domain is composed of non-overlapping triangles. The i-th face of triangle T is labelledby fi. The normal vector of fi pointing outward T that isscaled by the length of the face is ni. The volume of trian-gle T is VT . Following the finite volume methodology, allcomponents of the volume averaged quantities are stored at

    the mass center of the triangles.

    Application of the cell-centered finite volume discretiza-

    tion method leads to the following semi-discrete form of

    governing equations (3)

    dUTdt

    = 1VT

    f

    Ff nf , (4)

    where the summation is meant for all three faces of cell T ,and Ff is the flux tensor evaluated at face f . Let us considerface f in a coordinate frame attached to the face, such, thatits x-axes is normal to f (see Fig. 1). Face f separates tri-angle L (left) and triangle R (right). In this case the Ff nfscalar product equals to the x-component of F (i.e. Fx) mul-

    tiplied by the area of the face. In order to stabilize the so-

    lution procedure, artificial dissipation has to be introduced

    into the scheme. According to the standard procedure, this is

    achieved by replacing the physical flux tensor by the numer-

    ical flux function FN containing the dissipative stabilization

    term. A finite volume scheme is characterized by the eval-

    uation of FN , which is the function of both UL and UR. In

    this paper we employ the simple and robust Lax-Friedrichs

    numerical flux function defined as

    FN =FL + FR

    2 (|u|+ c) UR UL

    2. (5)

    In the last equation FL=Fx(UL) and FR=Fx(UR) and nota-

    tions |u| and |c| represent the average value of the u velocity

    f1

    f2

    f3

    n1

    n2

    n3

    L

    R

    xy

    Fig. 1. Interface with the normal vector and the cells re-

    quired in the computation

    component and the speed of sound at an interface, respec-

    tively. The temporal derivative is dicretized by the first-order

    forward Euler method:

    dUTdt

    =Un+1T UnT

    t, (6)

    where UnT is the known value of the state vector at timelevel n, Un+1T is the unknown value of the state vector attime level n+ 1, andt is the time step.

    By working out the algebra described so far, leads to

    the discrete form of the governing equations to compute the

    numerical flux term F .

    Un+1T = UnT

    t

    VT

    f

    RnfFf |nf |, (7)

    where Rnf is the rotation tensor describing the transforma-tion from the normal-parallel coordinate frame of face f tothe x y frame. Quantity Ff is defined in a coordinateframe attached to face f , with such an orientation that stateleft is identical to the state of the update triangle T whilestate right corresponds to the state of the triangle situated at

    the opposite side of the face. With these conventions, the

    normal component of the numerical flux function is given

    by

    F f =LuL + RuR

    2

    + (|u|+ c) R L2

    (8a)

    F uf =

    (Lu

    2L + pL

    )+(Ru

    2R + pR

    )

    2

    + (|u|+ c) RuR LuL2

    (8b)

    130

  • *

    +

    -1*

    /2

    -

    F_e

    D

    rho

    rhoR

    - D

    uR

    *D

    vR

    **

    eR

    +-

    rhoL

    D

    uL

    * D

    vL

    **

    eL

    +

    n1

    D D

    n2

    -1* -1* -1* D

    n3

    D D DD

    cR

    +

    cL pL

    D

    pR

    D

    +

    * + *D

    *

    +

    D

    *

    +

    * *DD

    *

    +

    D

    * *+ *

    D

    +

    /2

    * *

    D +

    +

    +

    /2

    * *

    +

    /2

    **

    /2

    D

    /2

    D

    -

    *

    F_rho

    +

    D

    D

    *

    F_rhou

    +

    D

    *

    F_rhov

    D

    +

    D

    /2

    D

    D

    -

    *

    F_e

    D D DD

    D

    D

    DD

    D

    D

    DDD

    D

    D

    D

    D

    D

    D

    D

    D

    DD

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    D

    Legend

    multiply

    multiply by -1

    divide by 2

    add

    subtract

    output

    input

    delay

    Fig. 2. The partitioned arithmetic unit generated from equations (8a) to (8d)

    F vf =LuLvL + RuRvR

    2

    + (|u|+ c) RvR LvL2

    (8c)

    FEf =(EL + pL)uL + (ER + pR)uR

    2

    + (|u|+ c) ER EL2

    (8d)

    The arithmetic unit for the computation of (8a) to (8d)

    is generated and partitioned using the algorithm described

    in [8]. In contrast to common graph partitioning which can-

    not target placement objectives, the main idea of the algo-

    rithm is to floorplan the vertices of the data-flow graph be-

    fore the partitioning starts. The vertices which represent the

    floating-point units are placed with a swap-based iterative

    algorithm to minimize the distance between the connected

    vertices. Using the spatial position of the vertices a sim-

    ple greedy algorithm creates clusters in which the number

    of I/Os is limited. In the resulting partition each cluster can

    be controlled with a simple control unit and the clusters can

    be mapped to the FPGA without long interconnecting sig-

    nals resulting in a fast operating frequency. The partitioned

    arithmetic unit is shown in Figure 2.

    Additionally a simple arithmetic unit is required to com-

    pute u, v, p, c values for each cell. The two arithmetic unitscontain 31 adders/subtractors and 38 multipliers, one recip-

    rocal and one square root unit. Multiplication and division

    by powers of two is implemented by modifying the expo-

    nent of the floating point numbers. A new interface value

    can be computed in each clock cycle therefore three clock

    cycles are required for updating each cell.

    4. ARCHITECTURE

    For pr...

Recommended

View more >