[IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable Logic and Applications (FPL) - FPGA based acceleration of computational fluid flow simulation on unstructured mesh geometry

  • View

  • Download

Embed Size (px)




    Zoltan Nagy1,2, Csaba Nemes2, Antal Hiba2, Andras Kiss1,2, Arpad Csk3, Peter Szolgay1,2

    1 Computer and Automation Research Institute, Hungarian Academy of Sciences

    Kende u. 13-17, H-1111, Budapest, Hungary.

    email: nagyz@sztaki.hu, kissa@sztaki.hu, szolgay@sztaki.hu2 Faculty of Information Technology, Pazmany Peter Catholic University

    email: nemcs@digitus.itk.ppke.hu, hiban@digitus.itk.ppke.hu3 Department of Mathematics and Computer Science, Szechenyi Istvan University

    email: dr.arpad.csik@gmail.com


    Numerical simulation of complex computational fluid dy-

    namics problems evolving in time plays an important role in

    scientific and engineering applications. Accurate behavior

    of dynamical systems can be understood using large scale

    simulations which traditionally requires expensive super-

    computing facilities. In the paper a Field Programmable

    Gate Array (FPGA) based framework is described to ac-

    celerate simulation of complex physical spatio-temporal

    phenomena. Simulating complicated geometries requires

    unstructured spatial discretization which results in irregular

    memory access patterns severely limiting computing per-

    formance. Data locality is improved by mesh node renum-

    bering technique which results in a sequential memory

    access pattern. Additionally storing a small window of cell-

    centered state values in the on-chip memory of the FPGA

    can increase data reuse and decrease memory bandwidth

    requirements. Generation of the floating-point data path and

    control structure of the arithmetic unit containing dozens of

    operators is a very challenging task when the goal is high

    operating frequency. Efficiency and use of the framework

    is described by a case study solving the Euler equations on

    an unstructured mesh using finite volume technique. On the

    currently available largest FPGA the generated architecture

    contains three processing elements working in parallel pro-

    viding 75 times speedup compared to a high performance


    This research project supported by the Janos Bolyai Research Scholar-

    ship of the Hungarian Academy of Sciences and OTKA Grant No. K84267

    and in part by OTKA Grant No. K68322. The support of the grants

    TAMOP-4.2.1.B-11/2/KMR-2011-0002 and TAMOP-4.2.2/B-10/1-2010-

    0014 is gratefully acknowledged.


    The number of transistors integrated into a single integrated

    circuit doubles in every 18 months according to Moores

    Law [1]. On recent devices around 4 billion transistors can

    be implemented. Since the early 2000s the operating fre-

    quency of the conventional high-performance microproces-

    sors has stopped in the 3-4GHz range due to power con-

    sumption limitations. As the number of transistors is in-

    creased significantly the main question is how to use them

    efficiently to create high performance computing systems.

    One possible solution is to integrate several conventional

    processors into the same chip creating a multi-core proces-

    sor. Another way is to use the general purpose comput-

    ing power of Graphics Processing Units (GPU), where hun-

    dreds of simpler processing elements can work in parallel.

    A more efficient way to increase computing power might

    be to design an application specific accelerator and imple-

    ment it on a Field Programmable Gate Array (FPGA). Due

    to increased transistor density, around one hundred double

    precision floating-point adders and multipliers can be im-

    plemented on the recent devices. Several previous studies

    proved the efficiency of field programmable logic devices in

    numerical simulation of various physical phenomena such

    as electromagnetic [2], transient wave [3] [4] and computa-

    tional fluid dynamics [5] [6] simulations.

    The most obvious way to solve complex spatio-temporal

    problems is numerical approximation over a regular mesh

    structure. Practical applications usually contain complex

    boundaries which can be handled by unstructured meshes

    more efficiently. The resolution of the mesh should be in-

    creased near rapid changes in the dynamics or shape of the

    problem. Unfortunately conventional microprocessors have

    around 10% utilization during unstructured mesh computa-tions [7] due to the irregular memory access pattern of grid

    978-1-4673-2256-0/12/$31.00 c2012 IEEE 128

  • data.

    To accelerate the computation, irregular memory access

    patterns should be hidden by temporally storing the relevant

    grid points by using the on-chip memory of the FPGA. Mesh

    points have to be properly ordered to minimize the size of

    this temporary memory. This optimization task is equiva-

    lent to the Matrix Bandwidth Minimization (MBM) prob-

    lem when the mesh is treated as an undirected graph and

    the bandwidth of the adjacency matrix of the graph is mini-


    When an explicit finite volume method is used during

    the solution of a PDE, a complex mathematical expression

    must be computed on the neighborhood of each node. The

    complexity of the resulting arithmetic unit is determined by

    the governing equations and the discretization method used.

    Usually the arithmetic unit is constructed using dozens of

    floating-point units which makes manual design very te-

    dious and error-prone. Therefore an automatic tool was de-

    veloped to generate the arithmetic unit using the discretized

    governing equations [8].


    Several papers were published in the early 2000s [9, 10]

    dealing with the acceleration of Partial Differential Equa-

    tions (PDE) on unstructured meshes. Most of them were fo-

    cused on accelerating Finite ElementMethods (FEM) where

    the global stiffness matrix is given and the resulting large

    linear system of equations are solved usually by the iterative

    Conjugate Gradient (CG) method. The most time consum-

    ing operation of this method is a sparse matrix vector mul-

    tiplication therefore most of the papers try to accelerate this

    part. Though our architecture is designed for explicit un-

    structured finite volume calculations, examination of these

    architectures is helpful because similar problems arising in

    our case too. For example the adjacency matrix of the mesh

    is sparse and elements of the state vector are accessed in a

    random order.

    Elkurdi et al. [11] proposed a process where the finite

    element sparse matrix was reorganized into a banded form

    in the first step. Then the matrix vector multiplication was

    calculated along special pipelineable diagonal stripes, where

    two successive elements could be processed by a pipelined

    architecture. Performance of the architecture was deter-

    mined by the available memory bandwidth and the sparsity

    of the matrix, however utilization of the processing elements

    was varying in a very wide range between 17.74 86.24%.Recently Nagar et al. [12] proposed an architecture us-

    ing an optimized Streaming Multiply- Accumulator with

    separate cache memories for matrix and vector data. The

    implementation platform they used has special memory

    architecture providing high 20GB/s peak memory band-

    width. Performance of the system with four FPGAs is in

    the 1.17 3.94GFLOPs range outperforming a Tesla 1070GPU. However utilization of the PEs is around 50% simi-larly to the previous architectures and increasing the number

    of PEs to occupy the entire FPGA still runs into a memory

    bandwidth limit.

    The surveyed architectures provide general solutions to

    accelerate FEM calculations on FPGAs but suffer from the

    inherent high memory bandwidth requirement and the small

    FLOPs per loaded bytes ratio of sparse matrix vector mul-

    tiplication. On the other hand utilization of the execution

    units depends on the structure of the sparse matrix.

    Irregular memory access pattern and high memory band-

    width requirement can be eliminated by storing a small part

    of the grid on-chip and reordering its nodes. Right hand side

    of the discretized equations should be computed for each

    volume in every timestep, which requires several floating-

    point operations, resulting in better FLOPs per loaded bytes

    ratio and higher utilization of FPGA resources.


    A wide range of industrial processes and scientific phenom-

    ena involve gas or fluids flows over complex obstacles, e.g.

    air flow around vehicles and buildings, the flow of water

    in the oceans or liquid in BioMEMS. In such applications

    the temporal evolution of non-ideal, compressible fluids is

    quite often modelled by the system of Navier-Stokes equa-

    tions. The model is based on the fundamental laws of mass-,

    momentum- and energy conservation, including the dissipa-

    tive effects of viscosity, diffusion and heat conduction. By

    neglecting all non-ideal processes and assuming adiabatic

    variations, we obtain the Euler equations [13, 14], describ-

    ing the dynamics of dissipation-free, inviscid, compressible

    fluids. They are a coupled set of nonlinear hyperbolic partial

    differential equations, in conservative form expressed as

    t+ (v) = 0 (1a)



    (vv + Ip

    )= 0 (1b)


    t+ ((E + p)v) = 0 (1c)

    where t denotes time, is the Nabla operator, is the