Published on
25Feb2017
View
219
Download
6
Embed Size (px)
Transcript
FPGA BASED ACCELERATION OF COMPUTATIONAL FLUID FLOW SIMULATION ON
UNSTRUCTURED MESH GEOMETRY
Zoltan Nagy1,2, Csaba Nemes2, Antal Hiba2, Andras Kiss1,2, Arpad Csk3, Peter Szolgay1,2
1 Computer and Automation Research Institute, Hungarian Academy of Sciences
Kende u. 1317, H1111, Budapest, Hungary.
email: nagyz@sztaki.hu, kissa@sztaki.hu, szolgay@sztaki.hu2 Faculty of Information Technology, Pazmany Peter Catholic University
email: nemcs@digitus.itk.ppke.hu, hiban@digitus.itk.ppke.hu3 Department of Mathematics and Computer Science, Szechenyi Istvan University
email: dr.arpad.csik@gmail.com
ABSTRACT
Numerical simulation of complex computational fluid dy
namics problems evolving in time plays an important role in
scientific and engineering applications. Accurate behavior
of dynamical systems can be understood using large scale
simulations which traditionally requires expensive super
computing facilities. In the paper a Field Programmable
Gate Array (FPGA) based framework is described to ac
celerate simulation of complex physical spatiotemporal
phenomena. Simulating complicated geometries requires
unstructured spatial discretization which results in irregular
memory access patterns severely limiting computing per
formance. Data locality is improved by mesh node renum
bering technique which results in a sequential memory
access pattern. Additionally storing a small window of cell
centered state values in the onchip memory of the FPGA
can increase data reuse and decrease memory bandwidth
requirements. Generation of the floatingpoint data path and
control structure of the arithmetic unit containing dozens of
operators is a very challenging task when the goal is high
operating frequency. Efficiency and use of the framework
is described by a case study solving the Euler equations on
an unstructured mesh using finite volume technique. On the
currently available largest FPGA the generated architecture
contains three processing elements working in parallel pro
viding 75 times speedup compared to a high performance
microprocessor.
This research project supported by the Janos Bolyai Research Scholar
ship of the Hungarian Academy of Sciences and OTKA Grant No. K84267
and in part by OTKA Grant No. K68322. The support of the grants
TAMOP4.2.1.B11/2/KMR20110002 and TAMOP4.2.2/B10/12010
0014 is gratefully acknowledged.
1. INTRODUCTION
The number of transistors integrated into a single integrated
circuit doubles in every 18 months according to Moores
Law [1]. On recent devices around 4 billion transistors can
be implemented. Since the early 2000s the operating fre
quency of the conventional highperformance microproces
sors has stopped in the 34GHz range due to power con
sumption limitations. As the number of transistors is in
creased significantly the main question is how to use them
efficiently to create high performance computing systems.
One possible solution is to integrate several conventional
processors into the same chip creating a multicore proces
sor. Another way is to use the general purpose comput
ing power of Graphics Processing Units (GPU), where hun
dreds of simpler processing elements can work in parallel.
A more efficient way to increase computing power might
be to design an application specific accelerator and imple
ment it on a Field Programmable Gate Array (FPGA). Due
to increased transistor density, around one hundred double
precision floatingpoint adders and multipliers can be im
plemented on the recent devices. Several previous studies
proved the efficiency of field programmable logic devices in
numerical simulation of various physical phenomena such
as electromagnetic [2], transient wave [3] [4] and computa
tional fluid dynamics [5] [6] simulations.
The most obvious way to solve complex spatiotemporal
problems is numerical approximation over a regular mesh
structure. Practical applications usually contain complex
boundaries which can be handled by unstructured meshes
more efficiently. The resolution of the mesh should be in
creased near rapid changes in the dynamics or shape of the
problem. Unfortunately conventional microprocessors have
around 10% utilization during unstructured mesh computations [7] due to the irregular memory access pattern of grid
9781467322560/12/$31.00 c2012 IEEE 128
data.
To accelerate the computation, irregular memory access
patterns should be hidden by temporally storing the relevant
grid points by using the onchip memory of the FPGA. Mesh
points have to be properly ordered to minimize the size of
this temporary memory. This optimization task is equiva
lent to the Matrix Bandwidth Minimization (MBM) prob
lem when the mesh is treated as an undirected graph and
the bandwidth of the adjacency matrix of the graph is mini
mized.
When an explicit finite volume method is used during
the solution of a PDE, a complex mathematical expression
must be computed on the neighborhood of each node. The
complexity of the resulting arithmetic unit is determined by
the governing equations and the discretization method used.
Usually the arithmetic unit is constructed using dozens of
floatingpoint units which makes manual design very te
dious and errorprone. Therefore an automatic tool was de
veloped to generate the arithmetic unit using the discretized
governing equations [8].
2. RELATED WORK
Several papers were published in the early 2000s [9, 10]
dealing with the acceleration of Partial Differential Equa
tions (PDE) on unstructured meshes. Most of them were fo
cused on accelerating Finite ElementMethods (FEM) where
the global stiffness matrix is given and the resulting large
linear system of equations are solved usually by the iterative
Conjugate Gradient (CG) method. The most time consum
ing operation of this method is a sparse matrix vector mul
tiplication therefore most of the papers try to accelerate this
part. Though our architecture is designed for explicit un
structured finite volume calculations, examination of these
architectures is helpful because similar problems arising in
our case too. For example the adjacency matrix of the mesh
is sparse and elements of the state vector are accessed in a
random order.
Elkurdi et al. [11] proposed a process where the finite
element sparse matrix was reorganized into a banded form
in the first step. Then the matrix vector multiplication was
calculated along special pipelineable diagonal stripes, where
two successive elements could be processed by a pipelined
architecture. Performance of the architecture was deter
mined by the available memory bandwidth and the sparsity
of the matrix, however utilization of the processing elements
was varying in a very wide range between 17.74 86.24%.Recently Nagar et al. [12] proposed an architecture us
ing an optimized Streaming Multiply Accumulator with
separate cache memories for matrix and vector data. The
implementation platform they used has special memory
architecture providing high 20GB/s peak memory band
width. Performance of the system with four FPGAs is in
the 1.17 3.94GFLOPs range outperforming a Tesla 1070GPU. However utilization of the PEs is around 50% similarly to the previous architectures and increasing the number
of PEs to occupy the entire FPGA still runs into a memory
bandwidth limit.
The surveyed architectures provide general solutions to
accelerate FEM calculations on FPGAs but suffer from the
inherent high memory bandwidth requirement and the small
FLOPs per loaded bytes ratio of sparse matrix vector mul
tiplication. On the other hand utilization of the execution
units depends on the structure of the sparse matrix.
Irregular memory access pattern and high memory band
width requirement can be eliminated by storing a small part
of the grid onchip and reordering its nodes. Right hand side
of the discretized equations should be computed for each
volume in every timestep, which requires several floating
point operations, resulting in better FLOPs per loaded bytes
ratio and higher utilization of FPGA resources.
3. FLUID FLOWS
A wide range of industrial processes and scientific phenom
ena involve gas or fluids flows over complex obstacles, e.g.
air flow around vehicles and buildings, the flow of water
in the oceans or liquid in BioMEMS. In such applications
the temporal evolution of nonideal, compressible fluids is
quite often modelled by the system of NavierStokes equa
tions. The model is based on the fundamental laws of mass,
momentum and energy conservation, including the dissipa
tive effects of viscosity, diffusion and heat conduction. By
neglecting all nonideal processes and assuming adiabatic
variations, we obtain the Euler equations [13, 14], describ
ing the dynamics of dissipationfree, inviscid, compressible
fluids. They are a coupled set of nonlinear hyperbolic partial
differential equations, in conservative form expressed as
t+ (v) = 0 (1a)
(v)
t+
(vv + Ip
)= 0 (1b)
E
t+ ((E + p)v) = 0 (1c)
where t denotes time, is the Nabla operator, is the density, u, v are the x and ycomponent of the velocity vectorv, respectively, p is the thermal pressure of the fluid, I is theidentity matrix, and E is the total energy density defined by
E =p
1 +1
2v v. (1d)
In equation (1d) the value of the ratio of specific heats is
taken to be = 1.4. For later use we introduce the conservative state vector U = [, u, v, E]T , the set of primitive variables P = [, u, v, p]T and the speed of sound
129
c =p/. It is also convenient to merge (1a), (1b) and
(1c) into hyperbolic conservation law form in terms of U and
the flux tensor
F =
vvv + Ip(E + p)v
(2)
as:U
t+ F = 0. (3)
3.1. Discretization of the governing equations
Logically structured arrangement of data is a convenient
choice for the efficient operation of the FPGA based im
plementations [6]. However, structured data representation
is not flexible for the spatial discretization of complex ge
ometries. As one of the main innovative contributions of
this paper, here we consider an unstructured, cellcentered
representation of physical quantities. In the following para
graphs we describe the mesh geometry, the update scheme,
and the main features of the numerical algorithm.
The computational domain is composed of nonoverlapping triangles. The ith face of triangle T is labelledby fi. The normal vector of fi pointing outward T that isscaled by the length of the face is ni. The volume of triangle T is VT . Following the finite volume methodology, allcomponents of the volume averaged quantities are stored at
the mass center of the triangles.
Application of the cellcentered finite volume discretiza
tion method leads to the following semidiscrete form of
governing equations (3)
dUTdt
= 1VT
f
Ff nf , (4)
where the summation is meant for all three faces of cell T ,and Ff is the flux tensor evaluated at face f . Let us considerface f in a coordinate frame attached to the face, such, thatits xaxes is normal to f (see Fig. 1). Face f separates triangle L (left) and triangle R (right). In this case the Ff nfscalar product equals to the xcomponent of F (i.e. Fx) mul
tiplied by the area of the face. In order to stabilize the so
lution procedure, artificial dissipation has to be introduced
into the scheme. According to the standard procedure, this is
achieved by replacing the physical flux tensor by the numer
ical flux function FN containing the dissipative stabilization
term. A finite volume scheme is characterized by the eval
uation of FN , which is the function of both UL and UR. In
this paper we employ the simple and robust LaxFriedrichs
numerical flux function defined as
FN =FL + FR
2 (u+ c) UR UL
2. (5)
In the last equation FL=Fx(UL) and FR=Fx(UR) and nota
tions u and c represent the average value of the u velocity
f1
f2
f3
n1
n2
n3
L
R
xy
Fig. 1. Interface with the normal vector and the cells re
quired in the computation
component and the speed of sound at an interface, respec
tively. The temporal derivative is dicretized by the firstorder
forward Euler method:
dUTdt
=Un+1T UnT
t, (6)
where UnT is the known value of the state vector at timelevel n, Un+1T is the unknown value of the state vector attime level n+ 1, andt is the time step.
By working out the algebra described so far, leads to
the discrete form of the governing equations to compute the
numerical flux term F .
Un+1T = UnT
t
VT
f
RnfFf nf , (7)
where Rnf is the rotation tensor describing the transformation from the normalparallel coordinate frame of face f tothe x y frame. Quantity Ff is defined in a coordinateframe attached to face f , with such an orientation that stateleft is identical to the state of the update triangle T whilestate right corresponds to the state of the triangle situated at
the opposite side of the face. With these conventions, the
normal component of the numerical flux function is given
by
F f =LuL + RuR
2
+ (u+ c) R L2
(8a)
F uf =
(Lu
2L + pL
)+(Ru
2R + pR
)
2
+ (u+ c) RuR LuL2
(8b)
130
*
+
1*
/2

F_e
D
rho
rhoR
 D
uR
*D
vR
**
eR
+
rhoL
D
uL
* D
vL
**
eL
+
n1
D D
n2
1* 1* 1* D
n3
D D DD
cR
+
cL pL
D
pR
D
+
* + *D
*
+
D
*
+
* *DD
*
+
D
* *+ *
D
+
/2
* *
D +
+
+
/2
* *
+
/2
**
/2
D
/2
D

*
F_rho
+
D
D
*
F_rhou
+
D
*
F_rhov
D
+
D
/2
D
D

*
F_e
D D DD
D
D
DD
D
D
DDD
D
D
D
D
D
D
D
D
DD
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
Legend
multiply
multiply by 1
divide by 2
add
subtract
output
input
delay
Fig. 2. The partitioned arithmetic unit generated from equations (8a) to (8d)
F vf =LuLvL + RuRvR
2
+ (u+ c) RvR LvL2
(8c)
FEf =(EL + pL)uL + (ER + pR)uR
2
+ (u+ c) ER EL2
(8d)
The arithmetic unit for the computation of (8a) to (8d)
is generated and partitioned using the algorithm described
in [8]. In contrast to common graph partitioning which can
not target placement objectives, the main idea of the algo
rithm is to floorplan the vertices of the dataflow graph be
fore the partitioning starts. The vertices which represent the
floatingpoint units are placed with a swapbased iterative
algorithm to minimize the distance between the connected
vertices. Using the spatial position of the vertices a sim
ple greedy algorithm creates clusters in which the number
of I/Os is limited. In the resulting partition each cluster can
be controlled with a simple control unit and the clusters can
be mapped to the FPGA without long interconnecting sig
nals resulting in a fast operating frequency. The partitioned
arithmetic unit is shown in Figure 2.
Additionally a simple arithmetic unit is required to com
pute u, v, p, c values for each cell. The two arithmetic unitscontain 31 adders/subtractors and 38 multipliers, one recip
rocal and one square root unit. Multiplication and division
by powers of two is implemented by modifying the expo
nent of the floating point numbers. A new interface value
can be computed in each clock cycle therefore three clock
cycles are required for updating each cell.
4. ARCHITECTURE
For pr...
Recommended