- Home
- Documents
- [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo,...

prev

next

out of 8

View

219Download

6

Embed Size (px)

Transcript

FPGA BASED ACCELERATION OF COMPUTATIONAL FLUID FLOW SIMULATION ON

UNSTRUCTURED MESH GEOMETRY

Zoltan Nagy1,2, Csaba Nemes2, Antal Hiba2, Andras Kiss1,2, Arpad Csk3, Peter Szolgay1,2

1 Computer and Automation Research Institute, Hungarian Academy of Sciences

Kende u. 13-17, H-1111, Budapest, Hungary.

email: nagyz@sztaki.hu, kissa@sztaki.hu, szolgay@sztaki.hu2 Faculty of Information Technology, Pazmany Peter Catholic University

email: nemcs@digitus.itk.ppke.hu, hiban@digitus.itk.ppke.hu3 Department of Mathematics and Computer Science, Szechenyi Istvan University

email: dr.arpad.csik@gmail.com

ABSTRACT

Numerical simulation of complex computational fluid dy-

namics problems evolving in time plays an important role in

scientific and engineering applications. Accurate behavior

of dynamical systems can be understood using large scale

simulations which traditionally requires expensive super-

computing facilities. In the paper a Field Programmable

Gate Array (FPGA) based framework is described to ac-

celerate simulation of complex physical spatio-temporal

phenomena. Simulating complicated geometries requires

unstructured spatial discretization which results in irregular

memory access patterns severely limiting computing per-

formance. Data locality is improved by mesh node renum-

bering technique which results in a sequential memory

access pattern. Additionally storing a small window of cell-

centered state values in the on-chip memory of the FPGA

can increase data reuse and decrease memory bandwidth

requirements. Generation of the floating-point data path and

control structure of the arithmetic unit containing dozens of

operators is a very challenging task when the goal is high

operating frequency. Efficiency and use of the framework

is described by a case study solving the Euler equations on

an unstructured mesh using finite volume technique. On the

currently available largest FPGA the generated architecture

contains three processing elements working in parallel pro-

viding 75 times speedup compared to a high performance

microprocessor.

This research project supported by the Janos Bolyai Research Scholar-

ship of the Hungarian Academy of Sciences and OTKA Grant No. K84267

and in part by OTKA Grant No. K68322. The support of the grants

TAMOP-4.2.1.B-11/2/KMR-2011-0002 and TAMOP-4.2.2/B-10/1-2010-

0014 is gratefully acknowledged.

1. INTRODUCTION

The number of transistors integrated into a single integrated

circuit doubles in every 18 months according to Moores

Law [1]. On recent devices around 4 billion transistors can

be implemented. Since the early 2000s the operating fre-

quency of the conventional high-performance microproces-

sors has stopped in the 3-4GHz range due to power con-

sumption limitations. As the number of transistors is in-

creased significantly the main question is how to use them

efficiently to create high performance computing systems.

One possible solution is to integrate several conventional

processors into the same chip creating a multi-core proces-

sor. Another way is to use the general purpose comput-

ing power of Graphics Processing Units (GPU), where hun-

dreds of simpler processing elements can work in parallel.

A more efficient way to increase computing power might

be to design an application specific accelerator and imple-

ment it on a Field Programmable Gate Array (FPGA). Due

to increased transistor density, around one hundred double

precision floating-point adders and multipliers can be im-

plemented on the recent devices. Several previous studies

proved the efficiency of field programmable logic devices in

numerical simulation of various physical phenomena such

as electromagnetic [2], transient wave [3] [4] and computa-

tional fluid dynamics [5] [6] simulations.

The most obvious way to solve complex spatio-temporal

problems is numerical approximation over a regular mesh

structure. Practical applications usually contain complex

boundaries which can be handled by unstructured meshes

more efficiently. The resolution of the mesh should be in-

creased near rapid changes in the dynamics or shape of the

problem. Unfortunately conventional microprocessors have

around 10% utilization during unstructured mesh computa-tions [7] due to the irregular memory access pattern of grid

978-1-4673-2256-0/12/$31.00 c2012 IEEE 128

data.

To accelerate the computation, irregular memory access

patterns should be hidden by temporally storing the relevant

grid points by using the on-chip memory of the FPGA. Mesh

points have to be properly ordered to minimize the size of

this temporary memory. This optimization task is equiva-

lent to the Matrix Bandwidth Minimization (MBM) prob-

lem when the mesh is treated as an undirected graph and

the bandwidth of the adjacency matrix of the graph is mini-

mized.

When an explicit finite volume method is used during

the solution of a PDE, a complex mathematical expression

must be computed on the neighborhood of each node. The

complexity of the resulting arithmetic unit is determined by

the governing equations and the discretization method used.

Usually the arithmetic unit is constructed using dozens of

floating-point units which makes manual design very te-

dious and error-prone. Therefore an automatic tool was de-

veloped to generate the arithmetic unit using the discretized

governing equations [8].

2. RELATED WORK

Several papers were published in the early 2000s [9, 10]

dealing with the acceleration of Partial Differential Equa-

tions (PDE) on unstructured meshes. Most of them were fo-

cused on accelerating Finite ElementMethods (FEM) where

the global stiffness matrix is given and the resulting large

linear system of equations are solved usually by the iterative

Conjugate Gradient (CG) method. The most time consum-

ing operation of this method is a sparse matrix vector mul-

tiplication therefore most of the papers try to accelerate this

part. Though our architecture is designed for explicit un-

structured finite volume calculations, examination of these

architectures is helpful because similar problems arising in

our case too. For example the adjacency matrix of the mesh

is sparse and elements of the state vector are accessed in a

random order.

Elkurdi et al. [11] proposed a process where the finite

element sparse matrix was reorganized into a banded form

in the first step. Then the matrix vector multiplication was

calculated along special pipelineable diagonal stripes, where

two successive elements could be processed by a pipelined

architecture. Performance of the architecture was deter-

mined by the available memory bandwidth and the sparsity

of the matrix, however utilization of the processing elements

was varying in a very wide range between 17.74 86.24%.Recently Nagar et al. [12] proposed an architecture us-

ing an optimized Streaming Multiply- Accumulator with

separate cache memories for matrix and vector data. The

implementation platform they used has special memory

architecture providing high 20GB/s peak memory band-

width. Performance of the system with four FPGAs is in

the 1.17 3.94GFLOPs range outperforming a Tesla 1070GPU. However utilization of the PEs is around 50% simi-larly to the previous architectures and increasing the number

of PEs to occupy the entire FPGA still runs into a memory

bandwidth limit.

The surveyed architectures provide general solutions to

accelerate FEM calculations on FPGAs but suffer from the

inherent high memory bandwidth requirement and the small

FLOPs per loaded bytes ratio of sparse matrix vector mul-

tiplication. On the other hand utilization of the execution

units depends on the structure of the sparse matrix.

Irregular memory access pattern and high memory band-

width requirement can be eliminated by storing a small part

of the grid on-chip and reordering its nodes. Right hand side

of the discretized equations should be computed for each

volume in every timestep, which requires several floating-

point operations, resulting in better FLOPs per loaded bytes

ratio and higher utilization of FPGA resources.

3. FLUID FLOWS

A wide range of industrial processes and scientific phenom-

ena involve gas or fluids flows over complex obstacles, e.g.

air flow around vehicles and buildings, the flow of water

in the oceans or liquid in BioMEMS. In such applications

the temporal evolution of non-ideal, compressible fluids is

quite often modelled by the system of Navier-Stokes equa-

tions. The model is based on the fundamental laws of mass-,

momentum- and energy conservation, including the dissipa-

tive effects of viscosity, diffusion and heat conduction. By

neglecting all non-ideal processes and assuming adiabatic

variations, we obtain the Euler equations [13, 14], describ-

ing the dynamics of dissipation-free, inviscid, compressible

fluids. They are a coupled set of nonlinear hyperbolic partial

differential equations, in conservative form expressed as

t+ (v) = 0 (1a)

(v)

t+

(vv + Ip

)= 0 (1b)

E

t+ ((E + p)v) = 0 (1c)

where t denotes time, is the Nabla operator, is the