The 9th International Conference on INFOrmatics and Systems (INFOS2014) – 15-17 December Parallel and Distributed Computing Track
Development of Parallel Chemical Transport Modeling using Graphical Processing Units A. Ali, S.E. Amin, H.H. Ramadan, M.F. Tolba Faculty of Computer and Information Sciences Ain Shams University Cairo, Egypt
[email protected]
Abstract—This paper describes the need for and the proposed design of the high performance computation for chemical transport model, with particular emphasis on Aerosol Optical Depth (AOD) determination. Throughout this paper, the AOD measurements of the Ozone Monitoring Instrument (OMI) are assimilated with Polyphemus model. The performance of the model simulation is simulated on three NIVIDIA Graphics Processing Units (GPUs). Multiple issues related to model analysis are studied: time partitioning to reduce truncation slip, and dimension partitioning to enhance chemical transport models’ multi-processing. The steps, strategies, processes, and optimization approaches of the GPU computations for the chemical transport model implementation are outlined. Additionally, a technique to access matrix row and column randomly is discussed. The technique could be applied to a matrix that consists of different sizes to be suitable to different memory sizes of the GPUs accelerator unit. The performance and acceleration in terms of speedup of the proposed techniques are studied on selected graphics cards. Our proposed enhancements reach an average of 3.31 speedup. Keywords— air quality; chemical transport model; parallel processing; GPUs, speedup.
I.
INTRODUCTION
Recently, multicore microprocessor chip architectures quick predominance is because of the heat dissipation, the increase of power utilization, plus additional issues. Climate forecast software and Chemical Transport Model (CTM) are two models of multi-physical imitations having a great level of parallelism. The purpose of this study is to develop the CTM’s computational competence so that the computational requests of long-time phases or big modeling domains are capable of compromising with a means of imitations in a logical time interval. GPUs are highly parallel of many core processors with a very large amount of computation-al horsepower along with very high memory bandwidth [1]. Previous work developing the KPP Rosenbrock chemical integrator for GPUs using CUDA has been completed by Michalakes et al.[2,3], demonstrating approximately a two-time speedup for KPP Rosenbrock chemistry on the GPU. There is a number of studies that has experimented with using CUDA for GPUs for large scale applications. References [4, 5, and 6] produced effective speedups ranging from 10x to 20x for finite element
Navier-Stokes solvers for more complex problems. Brandvik et al. [7] and Hagen et al. [8] also showed 10x to 20x speedups for 3D Euler solvers. Stantchev et al. [9] achieved up to a 14x speedup for plasma turbulence modeling using the HasegawaMima equation solver. Throughout this study, the AOD measurements of the OMI are investigated and used in the assimilation process. Two algorithms are applied to retrieve aerosol parameters from OMI reactance measurements, referred to as the near- UV algorithm and the multi-wavelength algorithm respectively. OMI aerosol optical thicknesses are better retrieved in the multi-wavelength retrieval than in the near UV [10]. In this study, AOD with wavelength 440nm is used and applied afterwards in all assimilation and validation processes. The paper is organized as follows: The regional 3D chemical transport model is firstly de-scribed. The next section discusses the development of GPU and its optimization for Polyphemus, including the mathematics and implementation details for implicit and explicit transport methods. The following section elaborates performance outcomes for the transport model since it calculates AOD over Europe, and quantifies the advantages of some significant improvements. Lastly, the fourth section demonstrates the conclusion and suggested issues. II.
MODELING DEVELOPMENT AND ACCELERATION OF PARALLEL CTM
CTM components used in this paper are illustrated in this section. Due to the importance of system configuration and implementation details, specific model mechanisms and data por-tioning techniques are demonstrated with sufficient details. In addition, this work will inves-tigate the benefits of running CTM on multicore parallel processing in order to assess the abil-ity of high performance computation to reduce the overall running time for CTM. Polyphemus system can deal with applications of passive tracers, radioactive decay, photo-chemistry, and aerosol dynamics from local to continental scales [11, 12, and 13]. Mass balance equation for concentrations of outline type is resolved by any CTMs to decide the destiny of contaminants in the air [14]. These equations are described in details in [15 and 16]. Transport and reaction are given by:
Copyright© 2014 by Faculty of Computers and Information–Cairo University
PDC-9
The 9th International Conference on INFOrmatics and Systems (INFOS2014) – 15-17 December Parallel and Distributed Computing Track
ci c (uci ) (Kci ) i t t
chem
(1)
Here, ci indicates the mean concentration of the species i. The terms on the right hand side of Equation (1) represent the changes in concentration due to the following processes: (uc i ) : Advection that is transport by wind, where u is the
vector of wind velocity. ( Kci ) : Turbulent diffusion, with the tensor of turbulent diffusion K. ci t
Fig. 2. 3D Time splitting, for x,y dimension by Δt/2 and by Δt in z dimension.
: Chemical conversion in the gas phase. chem
By taking into account an easy area discretization on a consistent network (xi = il) with a net width l = 1/n, answers can be approximated to Equation 1 via the limited distinction techniques. This makes the second order diffusion discretization and the third order upwind-biased advection discretization. Therefore, as shown in Fig. 1, the area of the chemical concentrations matrix is in terms of columns and rows. This process has several different benefits and is recognized by dimensional splitting. A high level of parallelism is presented by decreasing an n-dimensional issue to a group of one-dimensional issues.
III.
TRANSPORT MODELING USING GPU
GPUs devote more transistors to data processing and less to data caching in comparison to CPUs [1]. The GPU design permits that at the same time each processor exe-cutes the same part of the program, memory access time is hidden by more computation. Therefore, the GPUs execute data parallel programs very fast. Initially, GPUs were designed for the game market that needs high definition 3D graphics chipset. Recently, GPUs are used to run general, scientific, and high performance application quickly and cheaply. A number of GPU programming languages have been developed. NVIDIA has developed CUDA as a parallel programming model and software environment for developing CPU applications based on the C programming language [18]. This particular GeForce 8 GPU has 128 thread processors, also known as stream processors. (Graphics programmers call them “programmable pixel shaders.”). An example of GPU is shown in fig. 3 [19]. The first step needed to allow multiple grid cells (see Equation 1) to run at the same time is to send all of the data from the host to the device at the same time. Polyphemus stores the transport data in a number of three and four dimensional arrays. CUDA does not currently provide functionality for passing more than three dimensional arrays to the GPU. Therefore, each array was copied into a one dimensional array, with the data for each grid cell being listed one after another within each array. This allows a single function to be called to pass all of the data for each array from the host to the device.
Fig. 1. 3-D spatial domain partitioning in Polyphemus.
A local truncation slip at every instance step is presented by applying dimension splitting to these mathematical formulas. The time splitting technique can decrease this slip [17]. Fig. 2 shows a first order time splitting method for a three dimensional portioning into three one-dimensional problems.
Fig. 3. Architecture of Nvidia GeForce 8 [19].
Next, the device functions are required to be modified to access the correct elements of each large array. The thread index for each grid cell is calculated, using the CUDA block dimension and grid dimension variables. This allows each
Copyright© 2014 by Faculty of Computers and Information–Cairo University
PDC-10
The 9th International Conference on INFOrmatics and Systems (INFOS2014) – 15-17 December Parallel and Distributed Computing Track thread to access the data for a different grid cell. If statement verifies that the thread index is not larger than the number of grid cells, this prevents threads from attempting to access nonexistent data, since the number of total threads may be slightly larger than the number of grid cells in order to make the thread block size a multiple of the warp size [20].
A. GPU Setup and Model Simulation The hardware setup was Lenovo workstation equipped with quad-core 2.4 GHz CPU, 8 GB RAM, and the operating system is windows 7. Calculations with CUDA 5.0 were made on three graphics cards on the same PC with the following parameters:
The last step is to modify the execution configuration to use multiple threads and multiple thread blocks. A number of different configurations were tested, with the fastest multiple grid cell transport times being similar to the CPU times. The next step is to develop a driver API version, which will allow data to be left on the GPU through multiple global GPU function calls, reducing the number of data transfers between the CPU and GPU and therefore reducing the running time.
1) GPU NVidia GeForce GT 440 (1 GB, 96 cores) 2) GPU NVidia GeForce GT 640 (2 GB, 384 cores) 3) GPU NVidia GeForce GT 660 (2GB, 960 cores) In this work, the three cards will be defined as GPU1, GPU2, and GPU3 respectively.
This section will briefly review the different optimization attempts for Polyphemus transport on the GPU. There is a number of optimizations that proved to be successful and was fully implemented for Polyphemus transport. Experimenting with the number of registers and thread block size allowed us to find the best settings for the transport methods. Many settings gave similarly good performance; however, some settings gave a much slower performance. Fast math functions provided by CUDA allowed some computations to be performed faster without sacrificing accuracy. A number of different memory configurations were tested. Global memory allowed data to be transferred between the CPU and GPU; however, performing computations on data in global memory was very slow. Local memory performed very quickly for per thread data. Texture memory showed a similar performance to global memory for storing simulation data, allowing more data to be transferred from the CPU to the GPU if needed while providing a similar performance [21]. Shared memory allowed multiple threads to work together to compute each row/column of data. The main transport loops over each chemical concentration were parallelized, allowing the number of threads computing each row/column of data to be set equal to the number of chemical species per grid cell. The smaller loops at the beginning of these methods to calculate the Jacobian and other data were also parallelized. Due to the small number of independent rows/columns of data, computing multiple rows/columns of data per thread block allowed more GPU threads to be used at once. Using the driver API and shuffling subroutines al-lowed data to be left on the GPU between transport methods, reducing the number of memory transfers between the host and device. IV.
RESULTS AND VALIDATION
Several experiments are conducted to verify and validate the model prediction using different acceleration techniques. The analysis assists to measure the speedup and efficiency. Tests are performed to find the optimal settings for key simulation parameters for each meth-od and to compare different methods. Each GPU will also be compared to the original single threaded CPU. All results use identical inputs and identical scientific processes and are for the prediction of AOD for the year 2011.
The chemical transport model used in this study is Polair3D which is connected to the aero-sol model SIREAM, plugged to the chemistry-transport model Polair3D. SIREAM stands for SIze-REsolved Aerosol Model, and is described in [22]. SIREAM contains 16 aerosol species: 5 inorganic species (ammonium, sulfate, nitrate, chloride and sodium), 8 organic species predicted by the Secondary Organic Aerosol Model – SORGAM [23], and 3 primary species (mineral dust, black carbon and primary organic species), All these models are embedded in the Polyphemus system and are available at http://cerea.enpc.fr/polyphemus/. The simulations described here are carried out at a continental scale, over Europe, and for one year (2011). Polyphemus shows a tendency to underestimate AOD, and to overestimate nitrate concentrations in winter time. Other models also show similar behavior. The configuration of the simulations of this paper is essentially the same as in [13]. The main characteristics of the configuration follow the same steps described in [24]. The wind field matrix, the diffusion matrix, and the concentration matrix are the Polyphemus main memory necessities. NY×NX×NZ are the dimensions of these matrices. Polyphemus makes use of dimensional split to decrease a 3D issue to a group of 1D issue, similar to nearly all fixed network chemical transport models. The updating of concentration matrix is achieved in columns (the discretization is over the yaxis) and in rows (the discretization is over the x-axis), consequently these matrices are obviously stored as 2D arrays by Polyphemus. For every c chemical class for every network point the chemical concentrations are stored by the concentration matrix, overall the total is c×NY×NX×NZ double-precision variables. Concerning these trials, this is around 500.34MB of information, otherwise equals to 700,000 double-precision variables. Two floating point (doubleprecision) variables for every network point are covered by the wind field, equal to nearly 100.68MB of information. One floating point variable for every network point is stored by the diffusion matrix; overall the total is 50.34MB of information. For every phase, almost 220.31MB of information is computed. The size of the ultimate result file simply exceeds 41.1GB.This template was designed for two affiliations. B. GPU Design Strategies H There are many features in CUDA GPU, indicated in trade-off generality and facilitation of programming (by providing its own compiler, emulation mode, debugging tool
Copyright© 2014 by Faculty of Computers and Information–Cairo University
PDC-11
The 9th International Conference on INFOrmatics and Systems (INFOS2014) – 15-17 December Parallel and Distributed Computing Track and the profiler) in order to optimize the performance of the applications under certain circumstances that often occurs. Fortunately, we can take advantage of these features in GPUs in these circumstances and can be replicated with some code transformations in CTM implementations. •
Memory bandwidth:
Each processor should have uniform memory access (e.g., thread0 accesses address0, thread1 accesses addr0+4, thread2 accesses addr0+8 etc.) in order to achieve peak memory bandwidth. We can group many memory accesses into a single large memory access (termed coalescing operation) in case that the memory accesses are uniform, to achieve high memory bandwidth. In CUDA 1.2 compatible GPUs (and future families) memory coalescing is performed only if all SPs are within anSM accesses the same memory segment in any ordering (NVIDIA Programming manual); the approaches suggested in this paper are still applicable as they improve coalesced operation. •
T CPU T GPU
(2)
Fig. 4. Shows the elapsed training time, and the relative speed-up of three GPU implementations. Graphics card GT 650 has the best results thanks to the highest number of the stream processors and the size and bandwidth of its memory. However, slight speedup factor is due to the fact that for large problems, CUDA's overhead (such as initializing the kernel and copying memory back and forth between CPU memory and GPU memory) reduces significantly the performance gains of the fast matrix multiply and exponentiation. The limitation of GPU performance is due to the high occupancy percentage that exceeds 60%. Not only maximizing occupancy by increasing the simultaneous running blocks that re-duce the performance, but also running threads require a large amount of data that increases the latency. Another set of experiments are implemented to show the change of the input data size effect on execution time and speedup. The Bayesian network is implemented using an input data of 1 month, 3 months, 6 months, and 12 months respectively. Table 1, and fig. 5 show the execution time and speedup at different data size.
Memory usage:
The memory bandwidth and scale of CTM simulations are strongly affected by the used memory of different data structures in the simulator. The different data structures in the simulator use the memory which strongly affects the memory bandwidth and scale of CTM simulations. In our approach, we utilize techniques that minimize the usage of memory by incorporating sparse connectivity and by using reduced Address-Event-Representation (AER) format for storing firing information. Other compression techniques to eliminate repetition can be applied to further reduce the memory usage [25]. •
Speedup ( S )
Parallelism:
Mapping in a data-parallel fashion is required by the application so as to effectively use the GPU resources, to divide the task and make it parallel is based on 3 concepts which are thread, block and grid. A group of threads can form blocks and a group of blocks can form grid using indices. The allocation of each data part to a thread or to blocks is based on IDs indices, the subdivision of data parts is as fine as the bitlevel. •
of calculation has been calculated as a fraction of CPU time TCPU and the GPU time TGPU (Equation 2). Time calculation TCPU and TGPU were calculated as the statistical median of ten consecutive running calculations.
Fig. 4. Shows the time in minutes and speedup implementation using three different graphics cards.
of Polyphemus
Minimize thread divergence:
By architecture, the current CUDA GPUs selects a warp (small batches) of 32 threads each for 1.3 devices, and executes them using a single instruction register. Threads in a warp are free to take different execution paths (branch divergence).This often leads to sub-optimal performance. If all the threads through the wrap execute the same instruction, they can achieve the highest performance. It is important to notice that these factors are interrelated and they need to be optimized to be executed effectively on GPUs. For example, the overall parallelism might be reduced while using a technique that improves the memory usage which leads to minimizing the effective performance.
TABLE 1. SUMMARY OF EXECUTION TIME IN MINUTES AT DIFFERENT INPUT SIZE.
Exp
CPU Time
GPU 1
GPU 2
GPU 3
1 Months
350.1
233.4
122.3
94.6
3 Months
1050.3
750.5
470.3
294.4
6 Months
2170.6
1888.6
989.3
688.2
12 Months
4551.3
3975.9
2299.8
1444.4
C. Speedup The GPU implementation using CUDA is applied to the CTM with the configuration that was described before. The speedup
Copyright© 2014 by Faculty of Computers and Information–Cairo University
PDC-12
The 9th International Conference on INFOrmatics and Systems (INFOS2014) – 15-17 December Parallel and Distributed Computing Track fully take advantage of the power of the GPUs, using 60% of the GPU at best. TABLE 2. GPU RUNNING TIME (MINUTES) FOR EACH IMPLICIT TRANSPORT METHOD.
Transport Direction
GPU Time
Speedup
Occupy
X-Transport
232.43
3.23
60%
Y-Transport
171.95
3.54
60%
Z-Transport
130.73
3.18
60%
Average
178.3
3.31
60%
Fig. 5. Speedup of using GPUs with varying input size.
V. D. Sensitivity Analysis Firstly, implementation setting of the chemical transport model should be established. This can be done by including, changing the size of each thread block and the maximum number of registers per thread block for each of the three transport subroutines. For the thread block tests, the maximum number of registers for each method is used. As 66 threads can be run to compute each grid cell, the number of threads is selected as a multiple of 66. The data for up to four grid cells can be stored in the shared memory at once, permitting up to 264 threads to run. The register tests use 264 threads.
CONCLUSIONS
Throughout this whole study, the design and implementation of high performance computing schema to reduce the chemical transport model execution time are being discussed. Specifically, we have presented GPU innovative parallel techniques for parameter modeling, space, and time partitioning of CTM prediction, and implemented them on three different GPU cards. Clearly, diverse methodology is essential to completely utilize the benefit of accelerator based CPUs that execute chemical transport models. Through developing transport and chemistry code for the GPU, we can conclude that GPUs provide potential significant speedup for chemical transport models, within our improvement methods, when we increase the partitioning we minimized the resources utilized for each thread and the code is simplified. The three experiments show that the GPUs implementation take advantage of shared memory and a small memory requirement to achieve up to a 3.31 times speedup, with moderate GPU occupancy of 60%. REFERENCES [1]
[2] Fig. 6. Comparison of the running time of the implicit transport methods on the GPU for different thread block sizes. [3]
Fig. 6 demonstrates that using 132 and 264 threads give the best results, with X and Y transport producing a better results for 264 threads and 132 threads producing slightly better results for Z transport. As a result, 264 threads and the maximum number of registers are used for future implicit transport tests. Second, the comparison between GPU implicit implementation and original implicit transport routine is discussed. Table 2 shows running times between 130.73 and 232.43 minutes and speedups ranging from 3.18 to 3.54 for the implicit transport methods. These timings do not take into account memory transfer times. Additionally, the occupancy numbers show that the fastest explicit transport settings do not
[4]
[5]
[6]
[7]
Eller, P., Cheng, J.-R.C., &Albert, D.(2010).Acceleration of 2-D Finite Difference Time Domain Acoustic Wave Simulation Using GPUs, High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2010 DoD , vol., no., pp.350,356. doi: 10.1109/HPCMP-UGC.2010.43. Michalakes, J., &Vachharajani, M.(2008). GPU Acceleration of Numerical Weather Prediction, Parallel Processing Letters Vol. 18 No. 4.,World Scientific, Dec. 531-548. Michalakes, J., Vachharajani, M., Linford, J., &Sandu, A.(2009). GPU Acceleration of a Chemistry Kinetics Solver, http://www.mmm.ucar.edu/wrf/WG2/GPU/Chem.htm.NCAR. Goddeke, D., Buijssen, S.H.M., Wobker, H., &Turek, S.(2009). GPU Acceleration of anUnmodified Parallel Finite Element Navier-Stokes Solver.High Performance Computing and Simulation. Elsen, E., LeGresley, P., &Darve, E.(2008). Large calculation of the flow over a hypersonicvehicle using a GPU.J. Comput. Phys., 227(4):10148-10161. Thibault, Julien C., &InancSenocak (2009). CUDA implementation of a Navier-Stokes solver on multi-GPU desktop platforms for incompressible flows. In Proceedings of the 47th AIAA ASM. 2009758. Brandvik, T.,&Pullan, G.(2008). Acceleration of a 3D Euler solver using commodity graphics hardware, 46th AIAA Aerospace Sciences Meeting and Exhibit.
Copyright© 2014 by Faculty of Computers and Information–Cairo University
PDC-13
The 9th International Conference on INFOrmatics and Systems (INFOS2014) – 15-17 December Parallel and Distributed Computing Track [8] [9]
[10]
[11]
[12]
[13]
[14]
[15]
Hagen, T.R., Lie, K.,&Natvig, J., (2006). Solving the Euler equation on graphics processingunits, ICCS 2006, volume 3994 of LNCS, 220-227. Stantchev, G., Juba, D., Dorland, W., &Varshney, A. (2009). Using Graphics Processors for High-Performance Computation and Visualization of Plasma Turbulence.Comput. Sci. Eng. 11(2), 52-59. Ali, A., Amin, S.E., Ramadan, H.H., and Tolba M.F.(2011). Ozone monitoring instrument aerosol products: Algorithm modeling and validation with ground based measurements over Europe. The 2011 International Conference on Computer Engineering & Systems (ICCES’2011). Sartelet, K.N., Debry, E., Fahey, K.M., Roustan, Y., Tombette, M., &Sportisse, B.(2007). Simulation of aerosols and gas-phase species over Europe with the Polyphemus system, Part I: model-to-data comparison for 2001.Atmos. Environ., 29, 6116-6131. Quelo, D., Krysta, M., Bocquet, M., Isnard, O., Minier, Y., &Sportisse, B.(2007). Validation of the Polyphemus platform on the ETEX, Chernobyl andAlgeciras cases.Atmos. Environ., 41(26),5300-5315. Korsakissok, I.,&Mallet, V., (2008). Comparative study of Gaussian dispersionformulae within the Polyphemus platform: evaluation with Prairie Grass and Kincaid experiments.J. Appl. Meteorol. Ray, J., Kennedy, C.A., Lefantzi, S., &Najm, H.N.(2003).High-order spatial discretizations and extended stability methods for reacting flows on structured adaptively refined meshes. In Proceedings of Third Joint Meeting of the U.S. Sections of the Combustion Institute Chicago, USA. Sandu, A.,Daescu, D.N., Carmichael, G.R., &Chai T.(2005).Adjoint sensitivity analysis of regional air quality models.J. Comput. Phys., 204, 222–252.
[16] Hirsch, C.(1988). Numerical Computation of Internal and External Flows 1: fundamentals and numerical discretization. John Wiley and Sons, Chichester. [17] Hundsdorfer, W.(1996). Numerical solution of advectiondiffusionreaction equations. Technical report, Centrum voorWiskunde en Informatica. [18] CUDA VISUAL PROFILER 1.1 Documentation (distributed with SDK 2.1),(2008). [19] CUDA ZONE(2009).http://www.nvidia.com/object/cuda home.html, nVidia. [20] NVIDIA CUDA(2008). Compute Unified Device Architecture: Programming Guide Version2.0. [21] NVIDIA CUDA(2008). Compute Unified Device Architecture: Reference Manual 2.0. [22] Debry, E., Fahey, K., Sartelet, K., Sportisse, B.,&Tombette, M.(2007). Technical note: A new SIzeREsolved Aerosol Model.Atmos. Chem. Phys., 7, 1537-1547. [23] Schell, B., Ackermann,I.J., Hass, H.,Binkowski,F.S.,&EbelA. (2001). Modeling the formation of secondary organic aerosol within a comprehensive air quality model system. J. Geophys. Res.,106, 2827528293. [24] Tombette, M., Mallet, V., &Sportisse, B.(2009). PM10 data assimilation over Europe with the optimal interpolation method. Atmos. Chem. Phys., 9, 57-70.doi:10.5194/acp-9-57-2009. [25] Morrison, A., Mehring, C., Geisel, T., Aertsen, A., and Diesmann, "Advancing the boundaries of high-connectivity network simulation with distributed computing", Neural Computing. 2005 Aug;17(8):1776801.
Copyright© 2014 by Faculty of Computers and Information–Cairo University
PDC-14