1 MODELING OF ULTRASONIC PROCESSING Margaret Roylance, John Player, Walter Zukas 1, and David Roylance 2 Foster-Miller Inc., Waltham, MA ABSTRACT Curi...
1 Development of Parallel Chemical Transport Modeling using Graphical Processing Units A. Ali, S.E. Amin, H.H. Ramadan, M.F. Tolba Faculty of Computer...
1 Aerospace Engineering Conference Papers, Presentations and Posters Aerospace Engineering Ultrasonic NDE of thick composites R. Bruce Thompson Iowa S...
1 Computer Physics Communications 177 (2007) Quantum Monte Carlo on graphical processing units Amos G. Anderson a,, William A. Goddard III a, Peter Sc...
1 CSE 160 Lecture 24 Graphical Processing Units2 Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usu...
1 GENERAL-PURPOSE COMPUTATION USING GRAPHICAL PROCESSING UNITS Adrian Salazar, Texas A&M-University-Corpus Christi Faculty Advisor: Dr. Ahmed Mahd...
1 CT&F - Ciencia, A Tecnología PRACTICAL y Futuro IMPLEMENTATION - Vol. 6 Num. 2 OF Dec. ACOUSTIC 2015 Pag. FULL 5-16WAVEFORM INVERSION ON ...
1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2009; 21: Published online 20 July 2009 in Wiley InterScie...
1 0 Computational Science Technical Note CSTN-192 Geometric Optimisation using Karva for Graphical Processing Units Alwyn V. Husselmann and K. A. Hawi...
1 Geometric Optimisation using Karva for Graphical Processing Units A.V. Husselmann and K.A. Hawick Computer Science, Massey University, North Shore ,...
18th World Conference on Nondestructive Testing, 16-20 April 2012, Durban, South Africa
Graphical Processing Units (GPU)-based modeling for Acoustic and Ultrasonic NDE Nahas CHERUVALLYKUDY, Krishnan BALASUBRAMANIAM and Prabhu RAJAGOPAL Centre for NDE, Indian Institute of Technology - Madras, Chennai 600036, T.N., India
Abstract: With rapidly improving computational power numerical models are being developed for ever more complex problems that cannot be solved analytically, making them more and more computationally intensive. Parallel computing has emerged as an important paradigm to speed up the processing of such models. In recent years graphics processing units (GPU) are among the massively parallel devices widely available to the commodity market. General purpose computing on these devices went mainstream after the introduction of NVIDA ‘compute unified device architecture’ (CUDA). Here we develop CUDA implementations of the finite difference time domain (FDTD) scheme for three-dimensional problems of acoustic and two-dimensional problems of ultrasonic NDE. Simulations are run on the commodity GPU GeForce 9800GT. Results show a strong improvement in the speed of computation as compared to that of a serial implementation on a CPU. We also discuss the accuracy of GPU-based computation. Keywords: GPU Computing, CUDA, Acoustic Emission, Elastic wave propagation
Introduction From the introduction of microprocessors to early 2000’s programmers relied on faster processors to increase the speed of their programs, as software was mostly written serially and computers were also working serially. Hardware manufactures increased the frequency of the processors that decreased the average time taken to execute an instruction to build powerful computers. However as the frequency of each processor increased, the power consumption also rose dramatically. Moreover cooling the computer also became a major issue. Thus manufactures realized that computers cannot become faster any more: instead they have to grow ‘wider’. Instead of a very fast single core chip two or more cores of moderate speed started to appear in computers. This system has its problems too. Serially coded software does not run faster in wider computers: programs have to be ‘parallelized’. In parallel computing a problem is solved using multiple processing units simultaneously. The concept of parallel programming however, is not entirely new to new to the computing world, though. From 1970 onward vector processors were running parallel programs in high-end systems like supercomputers and high performance computing systems. However, with the introduction of multi-core processors in desktop PCs, parallel programs became the mainstream paradigm in day-to-day information processing. Graphical processing units (GPU) were introduced to personal computers much before multi-core CPUs. GPUs are processors dedicated to accelerating the building of images in the frame buffer intended for output display. These processors are highly parallel and multithreaded, and are much more efficient at handling processes that include large amount of data and single instructions. Their high capacity for single instruction multiple data algorithms made them suitable for video games and high quality video playback and processing systems. In a GPU
more numbers of transistors are devoted to data processing rather than data caching and flow control. Thus GPUs are well suited for data parallel algorithms with high floating point operations. After the introduction of single precision floating point arithmetic to mainstream consumer end graphics cards, programmers realized that multi-core GPUs can be used even for non-graphical algorithms that are of the nature of single instruction multiple data. Using shading language application programming interfaces (APIs), programmers have attempted to harvest the massively parallel GPU for general purpose scientific and engineering computations. In late 2006 NVIDIA introduced its new parallel computing architecture called ‘compute unified device architecture’, or CUDA . With the introduction of CUDA programming GPUs for general purpose computing become much more affordable and accessible to the average programmer. As a cross platform programming interface, OpenCL  was added to the scenario in 2008. But CUDA already took the lead in scientific and high performance computing world by providing a mature compiler, a better documentation, performance libraries and high end graphics cards like Tesla that fully dedicated for computational purpose only.
The CUDA programming model CUDA extends the standard C/C++ languages to make an interface to the GPUs. Normal programs are divided into kernels and host code where kernels are compiled and executed on the GPUs, while the host code is executed on the CPU. Kernels are the ‘compute intense’ portion of the program whereas the host code sets an environment for the program. Kernels are C functions that are executed ‘N’ times in parallel by ‘N’ different CUDA ‘threads’. Each process in a kernel is treated as a thread. These threads are grouped into ‘blocks’ which are further grouped into ‘grids’. Each thread has its own private memory and a shared memory visible to all threads in the same block. Apart from these two types there exists a third memory known as global memory, which can be accessed by any thread in the kernel and is the main memory for computations. As the number of cores in CPUs and GPUs grows the challenge is to create a program that scales automatically when we update the hardware. Since CUDA groups threads into blocks, these blocks can be executed on any processing core without disturbing any other block in the grid. Thus this programming model is extremely scalable. CUDA programs are executed similarly to other programs. The host code starts execution from the CPU and kernel then executes on GPU. At the end of GPU execution control is given back to CPU. The NVIDIA CUDA Compiler is used to compile the kernel and a normal C compiler such as Microsoft Visual C++ is used to compile the host code. On a Linux platform the CUDA Compiler can be used along with GCC. In this paper, using CUDA, we develop parallel finite difference scheme as applicable to two main areas of interest to NDE community, namely acoustic (sound in fluids) and ultrasonic wave propagation.
CUDA implementation of acoustic wave propagation The three-dimensional acoustic wave equation can be represented by the following form: (
where p represents pressure at a point in space and c represents the speed of acoustic waves in the (fluid) medium. Upon discretization using forward time (explicit) centered space finite difference scheme, Eq.(1) takes the form: (
where represents pressure at a spatial point (i,j,k) at the nth time step. The whole space at the nth time step is represented using a three-dimensional matrix. To increase the efficiency of computation, the three-dimensional matrix is then decomposed into several one-dimensional arrays and stored in memory. In sequential computing each array element is updated in order using a spatial loop which is then enclosed in a time loop. In our GPU-based parallel implementation, the spatial loop is decomposed into CUDA kernels which then update the entire spatial matrix in one go. Since each time step depends on the previous step, the time loop cannot be parallelized. The block size of the kernels is calculated using the CUDA occupancy calculator. For maximum computational efficiency here the block size was fixed to be16 X 16. Hence 256 threads are available in one block. The grid size is calculated according to the size of the problem domain. If we have M rows, N columns and Z number of slices then grid width will be (N/block width) X Z and grid height will be M/block height.The computational scheme developed for our CUDA implementation is represented in Figure 1. System specification A normal desktop PC was used. The Intel processor core i3-530 clocked at 2.93 GHz with 3.24GB ram was used as the host platform. NVIDIA GeForce 9800GT with 112 CUDA cores clocked at 1.37 GHz and 512 MB ram was used as the GPU device. Windows 7 32 bit (professional edition) was the operating system used.
Figure 1 : Cuda implementation of Acoustic Wave propagation
Problem specification Sound propagation through air was chosen as a simple illustrative problem. As we did not use absorbing or radiation conditions at the boundary, to save computational time a relatively small domain of size 1m3 was selected. A toneburst pulse was used to excite the medium at center for a minimum 100 time steps. Space was divided into 100 cells in each dimension. The temporal step was chosen accordingly to match stability criteria. Results While the problem took about 25 seconds to compute 500 time steps using the CPU and it took only 0.12 seconds with the GPU. Figure 2 shows an A-scan result obtained from a point near the center of the problem domain. The results are not amenable to easy physical interpretation due to the presence of multiple reflections from the domain-boundaries. The small difference between the amplitude values in the results from the CPU and the GPU values is perhaps because of their different architectures. We also note that the GeForce 9800GT device is not IEEE754 compliant processor , and is intended mainly for home use and the gaming market. We are currently investigating this issue further and also looking into whether IEEE compliant double precision capable GPUs provide any improvement in results.
Figure 2 A-scan monitored at the mid-point of the problem domain
CUDA implementation of elastic wave propagation We chose the first order velocity-stress formulation to represent elastic wave propagation. Assuming plan strain conditions, we can then write:
(7) where ρ is density and λ and μ are Lame’s constants. This system of equation is discretized using explicit centered finite difference scheme using a staggered grid and the leapfrog algorithm . In the leapfrog algorithm the field components are updated sequentially in time: the velocity components are calculated first then the stress components from the velocity components, the velocity components again using the stress components and so on. After discretization, the equation for velocity component
v xk 0.5 (i 0.5, j 0.5) v xk 0.5 (i 0.5, j 0.5)
t k 0.5 (v x (i 0.5, j 0.5) v xk 0.5 (i 0.5, j 0.5)) x (9)
In the same manner discretized equations can be obtained for all field components. Again, we find that the spatial loop can be enclosed in a temporal loop. In this case there are five field variables, two velocity components and three stress components: so we need five grids at the least. Calculation of block size and grid size was done similarly to the method discussed above. Problem and system specification Aluminium of size 30mm width and 15mm height with density 2667 kg/m3, Longitudinal wave velocity 6396 m/s and Shear wave velocity 3103 m/s was chosen for simulation. The excitation consisted of a 5-cycle Hanning windowed toneburst pulse centered at 2.25 MHz for applied the first 120 time-steps at the center of the model domain. The sample was divided into a grid of 317 X 158 with a time step size of 9.89x10-9 second . The problem was again solved on the same CPU and GPU devices described earlier. Results While the CPU took 1.500 seconds to solve the problem to 1000 time-steps, the GPU corresponding implementation took only 0.030 seconds. Figure 3 shows the A-scan obtained
Figure 3 A-scan from point 75,100
from a point near the center of the sample. Again, we believe the discrepancy in the results especially in the multiple-scattered signals is due to the low-end GPU we used: we are investigating this issue currently.
Conclusions and future work Explicit Finite Difference time domain (FDTD) implementations of two-dimensional elastic wave propagation and three-dimensional acoustic wave propagation were developed for GPU computation using NVIDIA CUDA. GPU computing is shown to be effective in increasing the computational speed. However the underlying architecture of the GPU device used strongly impacts the accuracies achievable. Ongoing and further work concerns addressing accuracy issues as well as the extension of CUDA implementations to other areas of NDE.
References 1. NVIDA CUDA C Programming Guide, Version 4.0 (5/6/2011) http://developer.nvidia.com/category/zone/cuda-zone 2. The OpenCL specification, Version 1.1, revision 44 (6/1/2011) http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf 3. N. Whitehead and A. Fit-Florea, Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs, http://developer.nvidia.com/content/precisionperformance-floating-point-and-ieee-754-compliance-nvidia-gpus 4. B. Deschizeaux and Jean-Yves Blanc, GPU Gems 3,edited by Hubert Nguyen, Ch. 38, Addison-Wesley, Boston, 2008, pp. 831-850 5. C.T. Schroder and W.R. Scott, IEEE Transactions on Geoscience and Remote Sensing, vol. 38, pp. 1505-512 (2000)