Computer Physics Communications 177 (2007) 298–306 www.elsevier.com/locate/cpc
Quantum Monte Carlo on graphical processing units Amos G. Anderson a,∗ , William A. Goddard III a , Peter Schröder b a Materials and Process Simulation Center, Division of Chemistry and Chemical Engineering,
California Institute of Technology (MC 139-74), Pasadena, CA 91125, USA b Department of Computer Science, California Institute of Technology, Pasadena, CA 91125, USA
Received 29 November 2006; received in revised form 10 March 2007; accepted 14 March 2007 Available online 30 March 2007
Abstract Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method is not only easily parallelizable across CPU clusters, but as we report here, it also has a high degree of data parallelism. This facilitates the use of recent technological advances in Graphical Processing Units (GPUs), a powerful type of processor well known to computer gamers. In this paper we report on an end-to-end QMC application with core elements of the algorithm running on a GPU. With individual kernels achieving as much as 30× speed up, the overall application performs at up to 6× faster relative to an optimized CPU implementation, yet requires only a modest increase in hardware cost. This demonstrates the speedup improvements possible for QMC in running on advanced hardware, thus exploring a path toward providing QMC level accuracy as a more standard tool. The major current challenge in running codes of this type on the GPU arises from the lack of fully compliant IEEE floating point implementations. To achieve better accuracy we propose the use of the Kahan summation formula in matrix multiplications. While this drops overall performance, we demonstrate that the proposed new algorithm can match CPU single precision. © 2007 Elsevier B.V. All rights reserved. PACS: 07.05.Bx; 02.70.Ss; 02.60.Dc; 89.20.Ff Keywords: Graphical processing units; Quantum Monte Carlo; Matrix multiplication; Floating point error; Kahan summation formula; De-normals
1. Introduction The rapid increase in GPU floating point performance and their excellent flops/$ characteristics, suggests that they may provide cost effective solutions for scientific computation problems. Given that the GPU computing model is (1) quite different from standard CPU models, (2) lacks a fully compliant IEEE floating point implementation, and (3) is optimized for very specific graphics type computational kernels, it is not clear a priori which scientific computing tasks are cost effective on GPUs. A number of scientific computing algorithms have been pursued on the GPU, e.g., fluid simulations [1,2], elasticity [3], * Corresponding author.
E-mail addresses:
[email protected] (A.G. Anderson),
[email protected] (W.A. Goddard III),
[email protected] (P. Schröder). 0010-4655/$ – see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.cpc.2007.03.004
and general finite element methods [4]. At the level of computational mathematics kernels, we have seen work on LU decomposition [5], matrix/vector products [6–14], iterative solvers [10,15], and transforms such as Fourier and Wavelet [7,16–18]. In some cases the results can be disappointing relative to highly tuned CPU implementations, in particular, when high precision answers are required, or when problem sizes do not hit a particular sweet spot (i.e., large matrices, or power-of-2 sized data structures, etc.). With continuing hardware development these performance barriers are being ameliorated, and with the recent announcement by nVidia of double precision availability on the GPU in 2007, computational precision is a fading problem as well. In this paper we consider quantum chemistry computations, the heart of which is the computation of the electronic structure of a given molecule using the quantum mechanical equations of motion. This information is critical for, among other tasks,
A.G. Anderson et al. / Computer Physics Communications 177 (2007) 298–306
finding optimized geometric structures for the molecule, reaction pathways, obtaining vibrational information, and providing a basis for developing higher level approximation methods including molecular dynamics simulations. Accurate results have application in catalysis, nanotechnology, drug design, and fuel cells, among many others. Due to the large state space (3N for N electrons) and the nonlinear nature of the time independent Schrödinger equation, exact results are all but impossible. Consequently a variety of approximation algorithms have been developed. One such approach, Quantum Monte Carlo (QMC) [19], is based on the stochastic evaluation of the underlying integrals and is guaranteed to produce accurate answers in the limit of infinite state space sampling. Even though a very large number of samples are typically required, QMC is easily parallelizable and scales as O(N 3 ) (albeit with a very large prefactor). This motivates a search for computational augmentation. We report on our implementation of QMC on the nVidia 7800 GTX and compare it against a 3.0 GHz Intel P4, considered to be representative of similar levels of technology. These technologies are improving very fast, both for CPUs and for GPUs. Currently however, the time to doubled performance on GPUs is noticeably shorter than for CPUs, leading to increasing performance advantages for GPUs if a computation maps well enough onto the GPU. Since CPUs are beginning to follow the same multicore technology trend, the notion that precision issues are temporal is reenforced. In the present paper, scientific results as well as underlying formalisms were simplified for purposes of presentation and to focus on the essential computational aspects. We admit that it is unclear how single precision results might be useful, especially for an algorithm designed to produce highly accurate results. In the mean time, our single precision implementation is presented. Aside from the performance of individual kernels we consider (1) precision issues arising from the noticeable differences to single precision IEEE floating point arithmetic, (2) performance issues arising from the specific sizes of matrices we must use, and (3) the overall performance of an end to end application when compared against a heavily tuned CPU based version. 2. Intro to Graphical Processing Units GPUs have received much interest outside the graphics world recently due to their immense processing power even though they are actually devices designed for very specialized tasks. Many reviews of GPU adaptability and compatibility are already available [1,20,21], and we do not attempt to improve upon them. In addition, there has been the development of specialized programming environments [6,22,23] for GPUs specifically designed to smooth the porting of non-graphics applications, and GPU vendors themselves have recently released general purpose GPU programming environments. Our approach was to start from the ground up in hopes of squeezing the best performance we can from the device. To describe our techniques a truncated description of the technology is required. The motivating principle for GPU design is that
299
simple calculations do not need general processors, so the addition of an auxiliary processor could both speed up graphics related calculations as well as free the CPU to complete other tasks. Since graphical calculations most typically involve drawing 2D images of colors ultimately intended for a screen, GPUs start with pixels (more generally referred to as fragments or texels) as the atomistic unit of data. Fragments are manifested here as 4 single precision floats, aliased as xyzw channels. A 2D array of fragments is called a texture, and is the fundamental storage class. A GPU will stream a region of a texture through an array of simple fragment processors (our nVidia 7800 GTX has 24), where each of these will produce one fragment as output. A programmer can utilize this process by designating a kernel for the fragment processors to use, resulting in the evaluation of data for a specified region in a texture. This entire procedure is commonly referred to as a pass. A kernel is a small program which in the graphics context would typically perform some shading calculation. There is nothing in principle preventing the user from writing a “shader” which performs some scientifically relevant computation using the broad class of functions available at the programmable shader level. In practice, many considerations are necessary in order to maximize efficiency. Graphics processing can be thought of as a sophisticated queuing system were a CPU sends a list of tasks to one (or more) connected GPUs and collects the results when the calculations are complete. This means that there are also processor communication factors that need to be included. As far as the GPU itself is concerned, we mention here the considerations – padding empty slots in texture data with 0 whenever data dimensions do not match dimensions on the GPU, – running as many passes with a kernel before swapping it for another since the GPU can only have one kernel loaded at a time, – careful data arrangement, – a tuning of how much of the computation as a whole should be assigned to each kernel, – and in general, keeping the GPU busy at all times. Before discussing how these concerns play out in our setting we give a brief high level introduction to Quantum Monte Carlo computations to understand the needed computational components which we seek to map to the GPU. 3. Intro to Quantum Monte Carlo The most important information about a molecule is its ground state energy, calculated by means of the time independent Schrödinger equation Ψ (¯r )Hˆ Ψ (¯r ) d¯r , E = (1) Ψ 2 (¯r ) d¯r where Ψ (¯r ) : R3N → R is the wave function, mapping the 3N Cartesian coordinates of N electrons into a probability amplitude related to the probability density in Eq. (4). (Eq. (1)
300
A.G. Anderson et al. / Computer Physics Communications 177 (2007) 298–306
includes the common restriction that Ψ (¯r ) is a real valued function.) The Hamiltonian operator Hˆ is given by 1 Hˆ = − ∇ 2 + V (¯r ), (2) 2 where the Laplacian is over all 3N electronic coordinates and calculates the kinetic energy (in the unitless Hartree measure) of the electrons in the molecule. The V (¯r ) term represents the potential energy due to Coulomb interactions between all pairs of electrons and nuclei. The energy E is the eigen value of Hˆ operating on the eigen function Ψ (¯r ). The ground state energy is the lowest such eigen value, and is of primary interest here. There are many methods to calculate Eq. (1) with varying degrees of accuracy and computational complexity. The highly accurate QMC family of algorithms [24] uses Metropolis [25] integration to fine tune the result provided by a cheaper method. It uses the local energy EL (¯r ) =
Hˆ Ψ (¯r ) 1 ∇ 2 Ψ (¯r ) =− + V (¯r ) Ψ (¯r ) 2 Ψ (¯r )
(3)
which represents an evaluation of the energy for a set of electronic coordinates. In terms of the stationary probability distribution of electrons ρ(¯r ) =
Ψ 2 (¯r )
(4)
Ψ 2 (¯r ) d¯r
we can transform Eq. (1) into the Monte Carlo integration form E =
Nt 1 EL (¯r t ). Nt →∞ Nt
ρ(¯r )EL (¯r ) d¯r = lim
(5)
t=1
Here r¯ t are a series of electronic coordinates generated with respect to ρ(¯r ) by some √ importance sampling scheme [26]. Since error scales as 1/ Nt in Monte Carlo methods a rather large number of samples is required to achieve useful accuracies. Additionally, it is common to run several independent series, called walkers, in order to minimize the error due to serial correlation between the Nt data points. In terms of computational complexity, the difficulty for QMC lies in the evaluation of ∇ 2 Ψ (¯r t ) for each EL (¯r t ) as well as the evaluation of Ψ (¯r t ) and ∇Ψ (¯r t ) which are used for importance sampling. The most common functional form for Ψ (¯r ) has at least three nested stages of evaluation. At the first stage, we place a collection of Nbf basis functions centered at the nuclei in the 3D coordinate space. Typically a given nucleus is associated with multiple basis functions. The basis function (j ) takes as argument the local coordinates of a given electron (i) relative to the nucleus, r ij = r i − R j . The best results are achieved with the following functional form k l m −b r 2 χj (xij , yij , zij ) = xijj yijj zij j (6) anj e nj ij . nj
For each basis function, Rj , kj , lj , mj , nj , anj and bnj are parameters given as input to the QMC program. The kj , lj , mj ∈ N+ parameters give the basis function the required symmetry, and nj ∈ N+ helps select the quality of fit. The other parameters are all real numbers.
The second stage of evaluation takes linear combinations of basis functions to create molecular orbitals. The kth orbital is given by φk ( ri ) = j χj (rij )cj k , where cj k ∈ R are coefficients input to QMC. These orbitals represent the spread of the electron across the entire molecule. Finally, the third stage of evaluation relevant to this study is the Slater determinant, chosen for its antisymmetric properties. For the Ns electrons of a given quantum spin (N = Nα + Nβ ∼ 2Nα ) the determinant is a function of the φk (which in turn are functions of the χj (rij )) φ1 ( r1 ) φ2 ( r1 ) · · · φNs ( r1 ) φ1 ( r2 ) φ2 ( r2 ) (7) Ds (¯r s ) = Ms (¯r s ) = .. .. . . φNs ( rNs ) φ1 ( rNs ) (here we partition r¯ into r¯ α and r¯ β ) and the wavefunction is Ψ (¯r ) = Dα (¯r α )Dβ (¯r β ). To calculate the kinetic energy, we first obtain ∇i2 φk ( ri ) = 2 rij )cj k , and then sum the contributions from all the j ∇i χj ( electrons in all the orbitals ∇ 2 Ψ (¯r ) = Ψ (¯r )
Ms−1 (¯r s ) ki ∇i2 φk ( ri ).
(8)
s∈{α,β} i,k∈Ns
A similar procedure is followed for calculating the gradient of the wavefunction for each electron with the exception that the final summation results in a vector of gradients. To summarize the algorithm, we are given a set of nuclear coordinates, basis function parameters, and the cj k , which describe the wavefunction as fit by some other (more approximate and cheaper) method. Additionally, we choose some parameters including the number of steps Nt , the number of walkers W , an initial guess scheme for positions r¯ of all the electrons, as well as several parameters relating to the importance sampling. Although specific choices are often related to the computational resources available and to the importance sampling method used, W is usually O(10) to O(103 ), Nt is O(104 ) to O(108 ), and the dimensions of cj k are usually between O(10) and O(103 ), depending upon the molecule. With these in hand, the algorithm can be stated as shown in Algorithm 1 (the ⊗ represents matrix multiplication), where simplifications have been included based on assumptions about the importance sampling. The high degree of parallelism is evident since each processor can calculate all the linear algebra for its walkers and only needs to produce a single value; the energy.1 One big advantage of QMC relative to alternative methods is the freedom one has in choosing the functional form of Ψ (¯r ). This is exploited by multiplying the Slater determinant wave function with a set of pairwise interaction terms which explicitly model electron correlation by employing inter-electronic coordinates. The only condition is that these terms, called Jastrow functions, preserve the antisymmetry of the wave function. 1 While some QMC algorithms only update one electron per Monte Carlo step, our method updates all at once [26].
A.G. Anderson et al. / Computer Physics Communications 177 (2007) 298–306
Esum ← 0 for w = 1 to W do r ij ← initialize() for t = 1 to Nt do for s = α and s = β do Ms ← χj ( rij ) ⊗ cj k ∂ χ ( Xs ← ∂x j rij ) ⊗ cj k
4. Implementation on the GPU
i
∂ χ ( Ys ← ∂y r ) ⊗ cj k i j ij ∂ Zs ← ∂z χj ( rij ) ⊗ cj k i
Ls ← ∇i2 χj ( rij ) ⊗ cj k end for Jastrow ← J (¯r ) Ψ ← det Mα ∗ det Mβ ∗ Jastrow Esum ← Esum + EL (Ms , Jastrow, {derivatives} . . .) r ij ← sampling(Ψ, r ij , Xs , Ys , Zs , Ls ) end for end for Eavg ← Esum /(Nt ∗ W ) Algorithm 1. The QMC algorithm.
To satisfy this condition, we use the functional form J (¯r ) =
eupq (rpq )
(9)
q
which provides a term for each particle–particle interaction, where Γ upq (rpq ) =
κ κ=1 apqκ rpq
1+
Λ
κ κ=1 bpqκ rpq
301
(10)
and p and q index all electrons and nuclei, and rpq is the distance separating the two particles. The number of terms (Γ and Λ) is arbitrary, and depends on the quality of fit. These parameters, along with apqκ , bpqκ ∈ R, are input to the QMC algorithm. With this modification, our wave function is now ΨQMC (¯r ) = Dα (¯r α )Dβ (¯r β )J (¯r ), and there are chain rule effects for the gradient and Laplacian. The rationale for these additional terms is the improved convergence if the wave function is a better approximation of the eigen function of Hˆ to begin with. Jastrow functions involving 3 particles were not considered here. Within the family of QMC algorithms, there are two popular varieties. The first is called Variational Monte Carlo (VMC) in which the procedure described in this section is employed to provide an exact integration for the given wavefunction. The method is termed variational since it is commonly coupled with a wavefunction optimization step. Diffusion Monte Carlo (DMC) uses the wavefunction only as a guide. Instead of a direct integration, it has a mechanism to project out a (mostly) correct wavefunction, and thus provide exact energies for the system. That said, a DMC calculation will converge better for higher quality wavefunctions. The subject matter considered here is agnostic to this choice except that DMC includes slightly more computational effort than VMC.
The QMcBeaver [27] code, under development in our group to perform QMC calculations, was used as the CPU implementation on which to base our study of a GPU implementation. In order to locate the computationally expensive components in the code, we minimize file I/O, ignore localization procedures which lead to sparser matrices [28,29], and we only consider single determinant, restricted Hartree–Fock wavefunctions. Moving all electrons at once allows us to use the highly optimized matrix multiplication routines available in the ATLAS 3.7.11 [30,31] BLAS library and use the LAPACK extension to ATLAS to perform the necessary matrix inversions. Using this representation of QMC as our starting point, we find that the computational effort on the CPU for N electrons is approximately 11% focused on the 10 dense matrix multiplications at O(N 3 ) each, 73% on the 10 basis function set evaluations at O(N 2 ) each, and 4% on the (electron–electron) pairwise Jastrow function evaluations at O(N 2 ). These fractional estimates are relatively stationary for molecules with as many as 150 electrons. The leading components not yet ported to the GPU include matrix inversion and electron–nuclear Jastrow functions as well as other processes specific to DMC. For the molecule sizes we are targeting the matrices are small and rectangular; specializations currently overlooked in GPU code. Combined with the fact that the cj k matrix can be reused for all matrix multiplications, we pursued several optimization strategies in detail. In particular, all of our kernels were designed to evaluate as many walkers simultaneously as GPU hardware limitations permit. 4.1. Walker batch scheme The GPU pipeline is very deep, so there is a substantial overhead cost for any calculation we wish to perform. This is in terms of work the GPU has to do to prepare for a given calculation, effort needed to move the GPU into full production efficiency, and any costs incurred by traversing the CPU/GPU boundary. This can be amortized by processing as many fragments simultaneously on the GPU as possible. For Monte Carlo type algorithms we can accomplish this by increasing the number of walkers processed per GPU pass. This has allowed us to tune both the size of the problem and the texture aspect ratio to the GPU. For example, we can arrange our data in GPU memory according to an empirically optimized pattern such as 4 rows by 4 columns so that each pass amounts to 16 walker evaluations in parallel. 4.2. Basis function evaluation The number of basis functions as well as their controlling parameters are chosen according to chemical considerations. Typical are 5 basis functions for each Hydrogen and 15 basis functions for each atom Lithium to Neon, leading to a matrix aspect ratio of between 4 and 8. The choice of basis set and all associated parameters are held fixed during a run and evalua-
302
A.G. Anderson et al. / Computer Physics Communications 177 (2007) 298–306
tion only depends on the 3N electronic coordinates, producing value, gradient, and Laplacian. 4.2.1. Kernel 1: Data generation The major choice regarding basis function evaluation (Eq. (6)) concerns the organization of the output data: different regions of one output texture or separation by channel (xyzw) resulting in two output textures. We opted for keeping the output in different regions so as to allow specialization (i.e. derivatives) of the kernels. As regards input data re-use, we opted for evaluating a single basis function for 4 electrons. This choice minimizes texture lookups and increases instruction parallelism since only one nj from Eq. (6) is used in the same fragment. 4.2.2. Kernel 2: Layout conversion Most matrix multiplication approaches on the GPU pack 2 × 2 submatrices into a single xyzw memory slot and we employed this layout as well. The basis function evaluation output is in 4 × 1 layout, necessitating a conversion which we used to filter out any bad values as well. Due to the batching (Section 4.1) texture layout, fences between rows and columns of walkers required special maintenance at this stage.
Fig. 1. The cost of correcting for the summation error in multiplication of square matrices. Indicated is the number of multiplications performed simultaneously, reusing the multiplicand.
4.3. Matrix multiplication For purposes of performance comparison we used the ATLAS 3.7.11 [30,31] library’s single precision matrix multiplication on our 3 GHz Pentium 4 as a CPU benchmark. For the GPU several studies of matrix multiplication performance have been performed [7–9,11,13,14] so our main focus is on the performance for the (relatively) small rectangular matrices we encounter in our application, as well as the fact that we use the same multiplicand for all multiplications. For the 2 × 2 layout the inner product for the pixel at C[i,j] becomes the series of pixel products for(k=0; k
Fig. 2. The dimension of the inner product is 6 times that as the short dimension shown. The multiplicand is reused for all 5 multiplications.
smaller matrices. When performing calculations using rectangular matrices, the set up costs can be quenched almost entirely. It is also apparent that for some domains, the GPU has significant performance gains relative to the CPU when CPU cache peculiarities play a role. Although the KSF error correcting algorithm (described in Section 5.2) negates most speedup gains for the particular technologies compared here, the hidden advantage remaining is that the calculation is performed on the GPU, minimizing GPU/CPU communication. 4.4. Jastrow functions The third most computationally demanding component of our QMC algorithm is the evaluation of the pairwise Jastrow function in Eq. (9). For the GPU implementation, we focused on porting the electron–electron terms (electron–nuclei terms are substantially fewer). We need to evaluate N choose 2 polynomials (one for each electron–electron pair) which are then summed. Since parameters in Eq. (10) differ between same/ opposite spin electron pairs, texture data is partitioned in order to allow kernel specialization.
A.G. Anderson et al. / Computer Physics Communications 177 (2007) 298–306
Fig. 3. Helium calculation showing the average and the error as the calculation progresses. The calculation was done at dt = 0.001, with 200 walkers each on 128 CPU processors.
We proceed in 3 steps: Kernel 1 evaluates the magnitude and normalized vector between all pairs of electrons for a total of 4 values per fragment. Kernel 2 finds the value, Laplacian, and gradient of Eq. (9), writing the first two to one texture and the latter three to another. Kernel 3 computes the sums, maintaining the electron indices for the gradient summands. 5. GPU floating point error One of the goals of quantum chemistry is the calculation of the electronic energy of a molecule with sufficient accuracy, stated as 1 to 2 kcal/mol. To this end absolute error of the final result must not be worse than 1 × 10−3 Hartrees. An appropriately parameterized QMC calculation can meet this criterion given enough Monte Carlo iterations. For this study, we want to consider whether single precision is satisfactory. To test this, three simple DMC calculations were performed on a large CPU cluster to compare numerically a result calculated in double precision with exactly the same calculation in single precision. First, a calculation is performed on a Helium atom using a 17s basis set [32] and a 2 determinant expansion in natural orbitals obtained using GAMESS [33]. Fig. 3 shows that the single and double precision results are very similar, where the exact answer is approximately −2.903724 hartrees [34]. Second, the torsional barrier in ethane was studied using the cc-pCVTZ [35] basis set with CCSD(T) optimized Eclipsed and Gauche configurations [36]. Fig. 4 again shows similar results between single and double precision, where the experimental value is 2.73 kcal/mol [36]. While these results are by no means conclusive, especially since the quality of the result is dependent upon the quality of the wavefunction, they provide evidence that single precision is not altogether unreasonable. This is can be seen since the iterates are decoupled to some degree from each other by random numbers, and since the Monte Carlo statistics it-
303
Fig. 4. Ethane calculation showing the average and the error as the calculation progresses. The calculation was done at dt = 0.005, with 200 walkers each on 128 CPU processors.
self happens in double precision. Furthermore, if a pathological electronic configuration is identified, it can always be more delicately handled on the CPU in double precision. Lastly, single precision QMC calculations might be useful in an independent VMC wavefunction optimization calculation. Since DMC only employs the wavefunction as a guide, variationally optimized parameters are far less restrictive in terms of precision. As far as our nVidia 7800 GTX GPU is concerned, we studied the floating point error to obtain a best estimate for single point evaluations. We considered two principal sources of error relevant to our problem as compared to the level of error available on a CPU: underflow and effects of rounding. The evaluation of basis functions (Eq. (6)), for example, can easily underflow if the bnj are too negative. We investigated whether the lack of de-normals on GPUs was a problem since this means a GPU will underflow faster than a CPU. As regards rounding, the IEEE floating point standard calls for a relative error of ±0.5 × 10−7 in the basic arithmetic operations for single precision. On current GPUs the relative error in these operations appears to be [37] at least ±0.5 × 10−7 and ±1.0 × 10−7 . For dense linear algebra this yields a difference in error between CPU and GPU computed results. 5.1. Underflow corrections To begin with, it is questionable whether one would permit de-normals to be included in calculations even on some CPUs. Many processor manufacturers elect software implementations of de-normals, which severely penalize the processing speed. Since we were unable to get decent timing results in matrix multiplication on the CPU unless de-normals were flushed to zero before multiplication, our performance comparisons actually already represent a lack of de-normals on both processors. Basis function evaluation involves exponentials with arguments negative enough to cause underflow, an effect we do not want to ignore. To avoid underflow error one may simply scale relevant variables to avoid the de-normal range, but must do so carefully to avoid the worse problem of overflow. The effect of this type of error depends heavily on the distribution of
304
A.G. Anderson et al. / Computer Physics Communications 177 (2007) 298–306
float4 T = 0, C = 0, Y = 0, S=0; int j = 0; while(j < N){ Y = A[i,k].xxzz*B[k,j].xyxy-C; T = S + Y; C = (T-S)-Y; S = T; Y = A[i,k].yyww*B[k,j].zwzw-C; T = S + Y; C = (T-S)-Y; S = T; j++; } return S; Algorithm 2. KSF-corrected GPU matrix multiplication. Fig. 5. KSF corrects for rounding error in matrix multiplication. The resultant matrix is 1000 × 1000, and the operand data is sampled from a uniform distribution [0, 1].
parameters, which is highly specific to our application. Thus we measured the effect of these shifts on the final calculated EL (¯r ) for each iteration, compared to the same calculation as performed on the CPU in double precision. The effect of shifting the exponential turns out to be relatively small for the set of parameters we considered. We conclude that shifting helps, but the lack of de-normals on the GPU turned out not to be a significant source of error. For parameter sets which consistently produce de-normals, single precision should probably be avoided entirely. 5.2. Kahan method Dense matrix multiplication is the most significant source of error in our computations when run on the GPU. Fig. 5 shows the roundoff error inherent in matrix multiplication, as estimated by multiplying two matrices created with a uniform distribution of data. As a function of the dimension of the inner product, we calculate the relative error averaged over all the elements in the resultant 1000 × 1000 matrix using CPU double precision as our reference data. The problem is due to the propagation of errors, which scales approximately linearly with the length of the inner products. A CPU typically minimizes this by performing the calculations at a higher precision than the data type. When summing a sequence of floating point numbers using xj (1 + the basic formula xj , the floating point result is δj ), where the perturbation error is defined as |δj | < (n − j ) and is the machine error. To compensate for the propagation of errors we use the Kahan summation formula (KSF) [38,39] in the context of matrix multiplication. This alternative method for summing a sequence of n numbers is shown below: S = x[1]; C = 0; for(j=2; j<=n; j++){ Y = x[j]-C; T = S + Y; C = (T-S)-Y; S = T; }
Fig. 6. The “QMC-Distributed” data for the multipliers was generated either on the CPU or on the GPU, and the matrix multiplication was either corrected using KSF or left as the standard method.
This method is algebraically equivalent, but if these steps are preserved during compilation, the algorithm has the power to produce the result xj (1 + δj ) + O(n 2 ) |xj | where |δj | 2 [40]. To explain this algorithm, one first observes that the low order bits of Y are lost when adding it to S. These bits can be recovered with the correction term C. The value for C is found by subtracting Y from the part of Y which is properly accounted for in the sum (the parenthesis are critical). This is not the only summation improvement available although it does compete well [41]. A simple modification makes the KSF suitable for use in matrix multiplications as shown in Algorithm 2. Here (i, j ) represents the coordinates of the element in the product matrix we are working on. It is important to note that the propagation error in addition is corrected for, but not any error due to multiplication, even though such corrections are possible [42]. However, as Fig. 5 shows, the improvement is enough to even beat single precision on the CPU for long enough inner products. To estimate the improvement that KSF provides for our QMC methods, we move to a “QMC distribution” of data for our multiplier matrices while keeping the multiplicand (representing cj k ) as uniformly random matrix. The distribution was formed by generating a representative set of basis function pa-
A.G. Anderson et al. / Computer Physics Communications 177 (2007) 298–306 Name Acetic acid Benzaldehyde [10]Annulene Diazobenzene Lysine Arginine HMX
Formula CH3 COOH C6 H5 CHO C10 H10 C12 H10 N2 C6 H14 N2 O2 C6 H14 N4 O2 C4 H8 N8 O8
305
Number of electrons
Number of basis functions
Total speedup Standard
KSF
Basis function speedup
Jastrow speedup
32 56 70 96 102 116 152
80 150 200 326 280 387 516
3.2 4.4 6.3 5.3 4.5 4.9 6.6
3.1 4.1 5.6 4.5 3.9 4.1 5.3
18.2 25.9 30.2 31.6 29.2 28.5 33.3
0.7 2.1 3.4 6.4 7.2 9.3 14.0
Fig. 7. QMC performance results on arbitrary molecules picked to represent varying problem sizes. Speedup is defined as the time spend processing on the CPU divided by the time spend processing on the GPU
rameters and a pseudo-random configuration of electrons. This distribution was evaluated either on the GPU or on the CPU and then sent to the GPU for multiplication. The relative error was again estimated against double precision on the CPU. Although the results in Fig. 6 have a higher variance, it shows that using the KSF method, we are able to approximately obtain equivalent results as CPU single precision. 6. Results To test the GPU port of our code, we sample 7 arbitrary molecules spanning the range over which we wish to measure performance. In Fig. 7 we present speedup for the calculation time spent on equivalent tasks performed on both our 7800 GTX GPU and our 3 GHz Pentium 4, as well as compare the final cost of incorporating the KSF correction. To further illustrate our performance gains, we plot the speedups in Fig. 8. We ran the calculations long enough to converge the speedup ratio. It is evident that for the range of molecules considered, the speed penalty incurred with KSF rose as the matrix multiplication cost became more prominent. The KSF formula served to keep the relative error in the calculated EL (¯r ) to a constant across all molecules at approximately 1 × 10−6 . It is worth noting that KSF did not make a significant difference in either speed nor correction for many of the smaller molecules. To provide an estimate for the impact of these speedup factors, we point out that for HMX, the calculation is now 5 to 7 times faster. This means that the new fractions of evaluation cost are that matrix multiplication, which formerly composed 15% of the cost is now only 4% (non-KSF) of the original total cost, the basis function cost went from 73% to 2.2%, and the electron–electron Jastrow evaluations which used to cost 3.5% of the effort are now 0.3%. If we approximate the effect of improving GPU technology over CPU technology as well as the possibility of multiple GPUs per CPU by setting the residual percentages at 0%, the original unaccounted for 8% suggests a theoretical factor of 13 speedup. A recent calculation [43] on free-base porphyrin which has 162 electrons and 938 basis functions in the cc-pVDZ basis set cost 40,000 CPU hours on an IBM SP POWER3+ cluster. Thus, ignoring the precision issue, we speculate that this calculation could theoretically cost 3,000 processor hours. Although some of the performance numbers for the individual kernels are very good, the code suffers from Amdahl’s Law type inefficiencies because of diminishing returns discovered
Fig. 8. Problem size is defined as the number of basis functions × the number of electrons. The data points are from the arbitrary molecules listed in Fig. 7.
during porting. This is for several reasons. A few of the elements of the computation, like the Monte Carlo statistical manipulations, cannot be permitted to be run in single precision. Furthermore, there are several portions of the code for which a GPU port is currently unsuitable due to a lack of sufficient data parallelism either as O(N ) components or as problems with GPU-unfriendly data interdependencies. With increasing capability on the GPU, more of the code will be available to porting considerations. It is obvious however that there is a GPU kind of Gustafson’s Law [44] advantage available. Specifically, if basis function and Jastrow function evaluations can be considered as essentially free, then one is encouraged to employ whatever functional form is deemed best, regardless of computational complexity. This is likely to increase both the quality of individual iterates as well as improve the overall convergence characteristics of a Monte Carlo calculation. Of course this assumes that these advantages are not washed out by precision errors stemming from other parts of the code. 7. Conclusion QMC type algorithms for first principles chemistry calculations are simple to parallelize and capable of exploiting the data parallel aspects of GPU based computing. While the matrix sizes needed in actual application practice are on the small side, recent generation GPUs coupled with a few tricks have become significantly better in achieving high performance at these sizes. The overall result is a 3× to 6× speedup in the end to end simulation application with a modest increase in hardware cost, making this a very cost effective solution. The lack of full IEEE floating point support is perhaps the most critical issue for QMC. We were able to correct for the error propagation,
306
A.G. Anderson et al. / Computer Physics Communications 177 (2007) 298–306
albeit only with a performance penalty due to the more complex evaluation cost of the Kahan summation formula. Clearly a more complete IEEE floating point treatment would be an excellent improvement, and forthcoming improvements will be welcomed. Beyond that, we note that due to the rapid evolution of GPU hardware (and the associated driver software), attaining a sweet spot in the performance landscape is a never ending quest of parameter and algorithm tweaking. We speculate that adoption of the GPU as a computational engine will be greatly facilitated if approaches such as ATLAS [8,30] and application specific libraries can be further brought to the GPU arena. Acknowledgements This research was supported by the DOE-ASCI and DOEASC grants at Caltech. We would like to thank nVidia for providing the 7800 GTX used in this study. This research has been supported in part by the DOE (W-7405-ENG-48/B341492 and DE-FG02-04ER25657) and the Caltech Center for Mathematics of Information. References [1] M. Pharr (Ed.), GPU Gems, vol. 2, Addison-Wesley, 2005. [2] Z. Fan, F. Qiu, A. Kaufman, S. Yoakum-Stover, GPU cluster for high performance computing, in: Proc. of ACM/IEEE Superc. Conf., 2004, p. 47. [3] J. Georgii, R. Westermann, A multigrid framework for real-time simulation of deformable bodies, Comput. Graphics 30 (3) (2006). [4] D. Göddeke, R. Strzodka, S. Turek, Accelerating double precision fem simulations with GPUs, in: Proc. of ASIM 2005, 2005, pp. 139–144. [5] N. Galoppo, N.K. Govindaraju, M. Henson, D. Manocha, Lu-GPU: Efficient algorithms for solving dense linear systems on graphics hardware, in: Proc. of ACM/IEEE Superc. Conf., Washington, DC, USA, IEEE Computer Society, 2005, p. 3. [6] nVidia, Compute unified device architecture, http://developer.nvidia.com/ cuda. [7] N.K. Govindaraju, S. Larsen, J. Gray, D. Manocha, A memory model for scientific algorithms on graphics processors, in: Proc. of ACM/IEEE Superc. Conf., Washington, DC, USA, IEEE Computer Society, 2006. [8] C. Jiang, M. Snir, Automatic tuning matrix multiplication performance on graphics hardware, in: PACT’05, 2005, pp. 185–196. [9] K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix–matrix multiplication, in: Proc. of ACM/EG Conf. on Graph. HW, 2004, pp. 133–137. [10] J. Bolz, I. Farmer, E. Grinspun, P. Schröder, Sparse matrix solvers on the GPU: conjugate gradients and multigrid, ACM Trans. Graph. 22 (3) (2003) 917–924. [11] J.D. Hall, N.A. Carr, J.C. Hart, Cache and bandwidth aware matrix multiplication on the GPU, Technical Report UIUCDCS-R-2003-2328, University of Illinois, 2003. [12] J. Krüger, R. Westermann, Linear algebra operators for GPU implementation of numerical algorithms, ACM Trans. Graph. 22 (3) (2003) 908–916. [13] Á. Moravánszky, Dense matrix algebra on the GPU, NovodeX AG, 2003. [14] S. Larsen, D. McAllister, Fast matrix multiplies using graphics hardware, 2001. [15] N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, A multigrid solver for boundary value problems using programmable graphics hardware, in: Proc. of ACM/EG Conf. on Graph. HW, 2003, pp. 102–111. [16] F. Xu, K. Mueller, Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware, in: Proc. of ACM/IEEE Superc. Conf., vol. 52, 2005, pp. 654–663. [17] P.A. Heng, J.Q. Wang, T.T. Wong, C.S. Leung, Discrete wavelet transform on GPU, in: ACM Workshop on GPGPU, 2004.
[18] K. Moreland, E. Angel, The FFT on a GPU, in: Proc. of ACM/EG Conf. on Graph. HW, 2003, pp. 112–119. [19] D.M. Ceperley, B.J. Alder, Quantum Monte Carlo, Science 231 (4738) (1986) 555–560. [20] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A.E. Lefohn, T.J. Purcell, A survey of general-purpose computation on graphics hardware, Computer Graphics Forum 26 (1) (2007) 80–113. [21] D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley, A. Lefohn, GPGPU: General purpose computation on graphics hardware, in: Course Notes, 2004. [22] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P. Hanrahan, Brook for GPUs: Stream computing on graphics hardware, in: Proc. of ACM SIGGRAPH, 2004. [23] M. McCool, S. Du Toit, Metaprogramming GPUs with Sh, AK Peters, Ltd., 2004. [24] P.J. Reynolds, D.M. Ceperley, B.J. Alder, W.A. Lester, Fixed-node Quantum Monte Carlo for molecules, J. Chem. Phys. 77 (11) (1982) 5593– 5603. [25] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculations by fast computing machines, J. Chem. Phys. 21 (1953) 1087. [26] C.J. Umrigar, M.P. Nightingale, K.J. Runge, A diffusion Monte Carlo algorithm with very small time-step errors, J. Chem. Phys. 99 (4) (1993) 2865–2890. [27] D.R. Kent IV, M.T. Feldmann, D.R. Fisher, A.G. Anderson, QMc Beaver v20051021©, 2001–2007. [28] A.J. Williamson, R.Q. Hood, J.C. Grossman, Linear-scaling quantum Monte Carlo calculations, Phys. Rev. Lett. 87 (24) (2001) 246406. [29] A. Aspuru-Guzik, R. Salomon-Ferrer, B. Austin, W.A. Lester, A sparse algorithm for the evaluation of the local energy in quantum Monte Carlo, J. Comput. Chem. 26 (7) (2005) 708–715. [30] R.C. Whaley, A. Petitet, J.J. Dongarra, Automated empirical optimization of software and the ATLAS project, Parallel Comput. 27 (1–2) (2001) 3– 35. [31] R.C. Whaley, J.J. Dongarra, Automatically tuned linear algebra software, in: Proc. of ACM/IEEE Superc. Conf., 1998. [32] S. Huzinaga, E. Miyoshi, M. Sekiya, A method of generating an effective orbital set for configuration interaction calculations, J. Chem. Phys. 100 (2) (1994) 1435–1439. [33] M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.H. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, J.A. Montgomery Jr., General atomic and molecular electronic structure system, J. Comput. Chem. 14 (11) (1993) 1347–1363. [34] C. Schwartz, Experiment and theory in computations of the He atom ground state, Int. J. Modern Phys. 15 (4) (2006) 877–888. [35] D.E. Woon, T.H. Dunning Jr., Gaussian basis sets for use in correlated molecular calculations. V. Core-valence basis sets for boron through neon, J. Chem. Phys. 103 (11) (1995) 4572–4585. [36] A.G. Csaszar, W.D. Allen, H.F. Schaefer III, In pursuit of the ab initio limit for conformational energy prototypes, J. Chem. Phys. 108 (23) (1998) 9751–9764. [37] K.E. Hillesland, A. Lastra, GPU floating-point paranoia, in: GP2 , 2004, p. C-8. [38] W. Kahan, Pracniques: Further remarks on reducing truncation errors, Comm. ACM 8 (1) (1965) 40. [39] D. Goldberg, What every computer scientist should know about floatingpoint arithmetic, ACM Comput. Surv. 23 (1) (1991) 5–48. [40] D.E. Knuth, Seminumerical Algorithms, third ed., The Art of Computer Programming, vol. 2, Addison-Wesley, 1997. [41] J.M. McNamee, A comparison of methods for accurate summation, SIGSAM Bull. 38 (1) (2004) 1–7. [42] T.J. Dekker, A floating point technique for extending the available precision, Numer. Math. 18 (3) (1971) 224–242. [43] A. Aspuru-Guzik, O. El Akramine, J.C. Grossman, W.A. Lester Jr., Quantum Monte Carlo for electronic excitations of free-base porphyrin, J. Chem. Phys. 120 (7) (2004) 3049–3050. [44] J.L. Gustafson, Reevaluating Amdahl’s law, Comm. ACM 31 (5) (1988) 532–533.