1 Parallelization strategies for continuum-generalized method of moments on the multithread systems A. Bustamam, T. Handhika, Ernastuti, and D. Kerami...

Author:
Eric Craig

0 downloads 23 Views 861KB Size

Severity: Notice

Message: Undefined index: description

Filename: shared/document_item_2.php

Line Number: 14

Backtrace:

File: /home/zdoc.pub/public_html/application/views/shared/document_item_2.php

Line: 14

Function: _error_handler

File: /home/zdoc.pub/public_html/application/views/document.php

Line: 109

Function: view

File: /home/zdoc.pub/public_html/application/controllers/Document.php

Line: 142

Function: view

File: /home/zdoc.pub/public_html/index.php

Line: 321

Function: require_once

Citation: AIP Conference Proceedings 1862, 030146 (2017); View online: https://doi.org/10.1063/1.4991250 View Table of Contents: http://aip.scitation.org/toc/apc/1862/1 Published by the American Institute of Physics

Articles you may be interested in Data preprocessing for determining outer/inner parallelization in the nested loop problem using OpenMP AIP Conference Proceedings 1862, 030138 (2017); 10.1063/1.4991242 Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing AIP Conference Proceedings 1862, 030153 (2017); 10.1063/1.4991257 Non-negative matrix factorization in texture feature for classification of dementia with MRI data AIP Conference Proceedings 1862, 030148 (2017); 10.1063/1.4991252 Analysis of Indonesian educational system standard with KSIM cross-impact method AIP Conference Proceedings 1862, 030135 (2017); 10.1063/1.4991239 Implementation of hierarchical clustering using k-mer sparse matrix to analyze MERS–CoV genetic relationship AIP Conference Proceedings 1862, 030142 (2017); 10.1063/1.4991246 Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL) AIP Conference Proceedings 1862, 030150 (2017); 10.1063/1.4991254

Parallelization Strategies for Continuum-Generalized Method of Moments on the Multi-Thread Systems A. Bustamam1, T. Handhika2, a), Ernastuti2, and D. Kerami1 1

Department of Mathematics, Faculty Mathematics and Natural Sciences (FMIPA), Universitas Indonesia, Depok 16424, Indonesia 2 Computational Mathematics Study Center, Gunadarma University, Depok, Indonesia a)

Corresponding author: [email protected]

Abstract. Continuum-Generalized Method of Moments (C-GMM) covers the Generalized Method of Moments (GMM) shortfall which is not as efficient as Maximum Likelihood estimator by using the continuum set of moment conditions in a GMM framework. However, this computation would take a very long time since optimizing regularization parameter. Unfortunately, these calculations are processed sequentially whereas in fact all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. This paper aims to speed up the calculation process of C-GMM by designing a parallel algorithm for C-GMM on the multi-thread systems. First, parallel regions are detected for the original C-GMM algorithm. There are two parallel regions in the original CGMM algorithm, that are contributed significantly to the reduction of computational time: the outer-loop and the innerloop. Furthermore, this parallel algorithm will be implemented with standard shared-memory application programming interface, i.e. Open Multi-Processing (OpenMP). The experiment shows that the outer-loop parallelization is the best strategy for any number of observations.

INTRODUCTION The uncertainty of the real world often causes a problem be modeled as a function of the number of unknown parameters. These parameters need to be estimated in advance so it is useful in understanding the behavior of related issues. This happens also in the field of finance, including asset pricing. There are several statistical methods that can be used to estimate the parameter of a model, for example, Maximum Likelihood Estimation (MLE) and Generalized Method of Moments (GMM). MLE requires information regarding probability density function (pdf) that makes the estimator become efficient. In fact, pdf is frequently not available in analytical form. It is different with a characteristic function which is used in GMM. But, the efficiency of GMM estimator depends on the appropriate set of moment conditions. Accordingly, Marine Carrasco and Jean-Pierre Florens [1] developed a method which combines the attractive features of GMM with the efficiency of MLE in one framework that was called Continuum-Generalized Method of Moments (C-GMM) to rely on a continuum of moment conditions in a GMM procedure. To improve the objectivity of the C-GMM method, it is required optimization stages to the new parameter, known as regularization parameter [2]. It is optimized through two-stage estimation in which the optimal regularization parameter is obtained by determining the regularization parameter that makes the Mean-Squared of Error (MSE) of the C-GMM estimator based on sampled simulation become minimum [3]. This, of course, takes a very long time, especially in the calculation which is processed sequentially. This increases the latency of calculation process, particularly impact on the arbitrage problem in asset pricing [4, 5, 6]. Meanwhile, all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. Parallel computing uses multiple compute resources simultaneously to solve a computational problem that is divided into several independent discrete tasks which can be solved concurrently on different processors [7]. Figure 1 illustrates a comparison of the time required between sequential and parallel calculation process for a problem that

International Symposium on Current Progress in Mathematics and Sciences 2016 (ISCPMS 2016) AIP Conf. Proc. 1862, 030146-1–030146-9; doi: 10.1063/1.4991250 Published by AIP Publishing. 978-0-7354-1536-2/$30.00

030146-1

FIGURE 1. The Comparison of Computational Time: Sequential VS Parallel Computing

can be divided into n independent tasks. It is seen that parallel computing with multiple processors is faster than sequential computing with only using one processor. Prior to 1990, Intel had put a million transistors onto a single chip. Since then the computer technology continues to evolve, especially in the high-performance computing (HPC). If some of Central Processing Units (CPU) previously behaved as shared-memory parallel machines, it is now only needed a single CPU with multi-core to perform parallelization by replicating substantial parts of a processor's logic on a single chip. Memory distribution constraints in the multi-core configuration then caused a computer was built with hierarchical memory systems that supply the processor with data and instructions at high rates by using a small, expensive and very fast memory called cache memory [8]. In addition to hierarchical memory systems, Intel's hyperthreading technology also improves the calculation performance by using thread concept. A thread is a runtime entity which is able to independently execute a stream of instructions that only needs its program counter and an area in memory to save its variables including registers and a stack [8]. Multiple threads may be executed on a single core even on multiple cores via context switches. The multi-thread systems start sequentially as a single-thread of execution, known as the initial thread. Afterward, it creates a team of threads for starting the parallelization. The initial thread becomes the master of the team and collaborates with the others to execute the code dynamically. Finally, they are joined to the only one thread (the initial thread) while all others terminate. Unfortunately, a phenomenon that occurs at this time is there are many programmers who still build a program sequentially on the parallel computer architecture. They do not take advantage of this HPC technology to improve the performance of their calculation. This paper aims to design a parallel algorithm for C-GMM on the multi-thread systems for reducing the latency to approximate real-time problem.

CONTINUUM-GENERALIZED METHOD OF MOMENTS

This paper restricts the attention to a random vector Markov process, ∈ ℜ , whose distribution is indexed by a finite dimensional parameter ∈ ℜ with true value . The moment function of this random vector Markov process is as follow:

ℎ , = − , , (1) where , is a conditional characteristic function of and = , ∈ ℜ . Next, let be a probability density function (pdf) on ℜ and be the Hilbert space of complex valued functions that are square integrable with respect to , i.e. [3]: = !": ℜ → %&' " (((((( " ) < ∞,, (((((( where " denotes the complex conjugate of " . By considering: 〈", .〉 = ' " (((((( . ) , 2 with ‖"‖ = 〈", "〉 and ℎ12 , = 2 ∑ 4 ℎ , , then the efficient C-GMM estimator is given by [1]:

030146-2

5 2 = 67. min〈< = ℎ1 2 , ;

, ℎ1 2 ,

〉,

where K is the Hilbert-Schmidt integral operator which is an asymptotic covariance operator associated with the moment conditions and satisfies: ? <" = '=? > , " ) , ((((((((((( where > , is the kernel defined by > , = @Aℎ , ℎ , B. This kernel can be estimated by >1 2 A , 5 B = ((((((((((((( 2 ∑ 4 ℎ A , 5 Bℎ A , 5 B with a sample of size T where 5 is a consistent first step estimator [3]. Hence, the

2

estimated Hilbert-Schmidt integral operator is denoted by <2 . Unfortunately, <2 is singular such that the <2= needs = to be estimated by

where is a pdf of random variables which have standard bivariate normal distribution. In most real life problems, this regularization parameter is unknown and needs to be estimated from a sample of size T. The regularization parameter can be obtained by minimizing the trace of MSE matrix of 5 2 E , as defined as follow [2]: E2 = 67. min Σ 2 E, , where Σ2 E, = [email protected] K 5 2 E − unknown distribution of 5 2 E − regularization parameter as [3]:

C∈G , H

K . This raises problems related to possible infiniteness of Σ 2 E,

and

for the finite sample. To hedge against such situations, considering optimal

E2

= 67. min Σ 2 E, C∈G , H

,L ,

(3)

|N2 E, (4) Σ 2 E, , L = 1 − L [email protected] N2 E, < PQ + LPQ J, 5 5 is the truncated MSE of 2 E with Σ 2 E, = K 2 E − K and PQ satisfies L = R7 N2 E, > PQ . Nevertheless, the true parameter value is unknown, then it needs to be estimated via parametric bootstrap: 5 2 E = 67. minKℎ1 2 , K , (5) where:

;

which defined as a C-GMM estimator of . T , then be obtained By using 5 2 to simulate M independent samples of size T from M arbitrary initial values T T T 5 T U2 A 5 2 B for W = 1,2, ⋯ , Z which satisfies =" , 2, [ for = 1,2, ⋯ , J with " ∙ is a three times

continuously differentiable function with respect to 5 2 and [ T are independent and identically distributed white noise whose distribution are known and do not depend on 5 2 . Based on each simulation, then estimate the parameter T by using 52 AE , 5 2 B which is denoted as a C-GMM estimator at (2) with i-th regularization parameter from j-th sample for any value of regularization parameter that is evaluated. Hence, the optimal regularization parameter at euation (3) is obtained by selected a point in the grid for regularization parameter which satisfies: E] 2^ A 5 2 B = 67. min Σ1 2^ AE, 5 2 , LB, (6) C∈G , H

where Σ1 2^ AE, 5 2 , LB is the truncated MSE estimator of Σ2 E, , L in equation (4) which is given by: 2 Σ1 2^ AE, 5 2 , LB = ∑^ N AE, 5 2 B_ANT,2 AE, 5 2 B ≤ P]Q B + LP]Q J, ^ T4 T,2 where

(7)

T 5 ∑^ ] Q B = 1 − L and NT,2 AE, 5 2 B = c 52 AE, 5 2 B − 5 2 c . T4 _ANT,2 AE, 2 B ≤ P For some assumptions, using E] 2^ A 5 2 B at (6) to estimate 5 2 E in equation (2) does not affect the consistency, asymptotic normality, and efficiency of the C-GMM estimator [3]. Therefore, the feasible efficient C-GMM estimator which depends on the optimal regularization parameter is defined as follow: 5 2 E] 2^ = 67. min〈

;

PARALLELIZATION STRATEGIES In the previous section, it has been discussed briefly how to estimate the parameter model by using the C-GMM method. It is seen that the manual calculation process is extremely complex so it needed a tool such as a computer

030146-3

for programming this method. But earlier, the C-GMM method had to be written back into a structured and systematic sequential algorithm such that minimizing programming error. To understand parallelization strategies for the sequential C-GMM algorithm, the algorithm will be written in terms of vectors and matrices. Let E( = E , E , ⋯ , Eh i is a vector of the grid for regularization parameter with size N. Besides, M independent simulated samples of size T is given by: ^ ⋯ ^ q ⋯ l U2^ = k p ⋮ ⋮ ⋮ ⋱ ^ ⋯ 2 2 o j 2

where r be denoted j-th simulated sample for k-th time. As mentioned earlier, U2 denotes a j-th column vector of U2^ which defined as j-th simulated sample of size T. Hence, the original C-GMM algorithm is given in Fig. 2. In this paper, the optimal parameters are obtained through the Nelder-Mead algorithm [10] which combined with numerical integration. The variables that are included consist of the number of observations and simulations, initial guess for parameter model, the maximum iteration, and tolerance in the optimization stages, the interval values of the grid for and the regularization parameter together with the number of points in each grid. The sequential algorithm on Fig. 2 is successfully programmed in C. Its result shows that the computational time required takes a quite long time. It raises the latency issues that have implications for the arbitrage problems in asset pricing, as mentioned in the introduction. Afterward, this algorithm will be parallelized to reduce the latency of calculation process. There are few important things to note related to the effective and efficient parallel algorithm design such as data dependence, load balance, concurrency and data synchronization [7]. Because of the independence of the data/instructions, only some parts of the algorithm that have a chance to be parallelized (parallel region). There are four parallel regions in the original C-GMM algorithm, i.e. STEP 1, STEP 3, outer-loop and inner-loop. Implementation results for the original C-GMM algorithm show that only the computation of outer-loop and innerloop that requires a long time. Meanwhile, the other parallel regions, STEP 1 and STEP 3, only require the T

T

FIGURE 2. Determination of Optimal Parallel Region with and without Data Preprocessing

030146-4

computational process in seconds where the experiments are related to the interval values and the number of points in the grid of each variable as well as the number of simulation. Furthermore, parallelization strategies for the parallel regions that are contribute significantly to the reduction of computational time will be discussed in the next two subsections.

Parallel Region: The Outer-Loop at the Original C-GMM Algorithm

The outer-loop at the original C-GMM algorithm calculate Σ1 2^ AE , 5 2 , LB for each point in the grid for regularization parameter, E , which are evaluated. This results in a vector with size N, i.e. Σ1 2^ AE(, 5 2 , LB. Suppose that there are p threads that can be used in distributing data/instructions. They have thread ID which starting from zero and ending at p-1. The outer-loop of the original C-GMM algorithm is parallelized by dividing the E( into > ∗∗ i independent subvectors where the k-th subvector with size tr , denoted by E(hu = AE u , E u , ⋯ , Ehu B , for 0 ≤ > ≤ ∗∗ r = tr = t. By denoting the other one subvector of E( with size tr ∗ as E(hu∗ then E u ≠ > ∗∗ − 1 ≤ w − 1 and ∑r4 Eyu∗ for any element E u and Eyu∗ in the two different random subvectors E(hu and E(hu∗ , respectively, for all 1 ≤ zr ≤ tr , 1 ≤ {r ∗ ≤ tr ∗ and 0 ≤ > < > ∗ ≤ > ∗∗ − 1 ≤ w − 1. This parallelization strategy makes the calculation process of Σ1 2^ AE(, 5 2 , LB being divided into > ∗∗ threads where k-th thread will process the calculation of a vector with size M, 5 2^ AE u B, in the tr iterations until they result also a vector with size tr , i.e. Σ1 2^ AE(hu , 5 2 , LB. Finally, all of the vectors will be joined into a vector with size N, Σ1 2^ AE(, 5 2 , LB. The flowchart in Fig. 3 illustrates parallelization strategy for the outer-loop of the original C-GMM algorithm where > ∗∗ = w.

FIGURE 3. The Parallel Programming Flowchart for the Outer-Loop of the Original C-GMM Algorithm

030146-5

Parallel Region: The Inner-Loop at the Original C-GMM Algorithm

Different with the outer-loop of the original C-GMM algorithm, the inner-loop optimize 52 E for each sample that was simulated in STEP 3, U2T , given E . This results in a matrix of size Z × t, i.e. 5 2^ E( . In this case, the inner-loop at the original C-GMM algorithm is parallelized by dividing the simulated samples, U2^ , into > ∗∗ independent submatrices. Let the k-th submatrix with size J × Zr is given by: T

l ^ U2 u = k j

u

⋮

2

u u

u

⋮

2

u u

⋯ ⋯

⋱ ⋯

^u

^u

⋮

^u 2

q p o

where 0 ≤ > ≤ > ∗∗ − 1 ≤ w − 1 and ∑rr4 = Zr = Z as illustrated in Fig. 4 for > ∗∗ = w. By denoting the other one ∗∗

submatrix of U2^ with size J × Zr as U2 u then U2 u ≠ U2 different random submatrices > ∗∗ − 1 ≤ w − 1.

^ U2 u

^ ∗

and

^ ∗ U2 u ,

T

A}u∗ B

for any column vector U2 u and U2 T

A}u∗ B

respectively, for all 1 ≤ Wr ≤ Zr , 1 ≤ ~r ∗ ≤ Zr ∗ and 0 ≤ > < > ∗ ≤

FIGURE 4. The Parallel Programming Flowchart for the Inner-Loop of the Original C-GMM Algorithm

030146-6

in the two

The flowchart in Fig. 4 shows that the calculation process of 5 2^ E being divided into > ∗∗ threads where k-th thread will process the calculation of a subvector of 5 2^u E with size Zr . This parallelization is programmed in N iterations.

RESULTS AND DISCUSSION In the Flynn's taxonomy, the parallelization of both parallel regions, the outer-loop and the inner-loop at the original C-GMM algorithm, are classified as Single Program Multiple Data (SPMD) [11] because of each thread will process the different data sets using same formula. In this paper, the parallel algorithm will be programmed with Open Multi-Processing (OpenMP) on the platform consists of a personal computer equipped with an Intel Core i7 Quad-Core at 2.3GHz and 8GB of RAM. OpenMP is a shared-memory application programming interface (API) which provides a directive compiler to facilitate shared-memory parallel programming [8]. This makes it easy for programmers by preventing a number of errors in programming through a structured and systematically approach. OpenMP can be written in three languages: C, C++ and Fortran where in this paper, the parallel algorithm will be programmed also in C with OpenMP 3.1. Parallelization strategies for the original C-GMM algorithm will be implemented to the Cox-Ingersoll-Ross (CIR) model that models short-rate at time t, denoted by 7 , in the following stochastic differential equation [12]: „ , (9) )7 = •A€ − 7 B) + •‚7 )ƒ „ with •, €, • and 7 are non-negative and ƒ is a Brownian motion under a risk-neutral probability measure R… where: • : speed of adjustment, € : reversion level of short-rate, • 7 : variance of short-rate. It is seen that the drift coefficient depends on 7 . If 7 is less than €, then the drift coefficient becomes positive, and vice versa. Therefore, short-rate moves toward a reversion level which depends on the speed of adjustment. Meanwhile, the diffusion coefficient describes that CIR model implicitly assumes short-rate as a non-negative stochastic process. Suppose that = •, €, • i to implement the C-GMM method in estimating the parameter of CIR model. Moreover, it can be also determined the conditional characteristic function of CIR model in equation (9), 7 + Δ , given 7 , as follows [13]: ‡

, •, €, • = 1 −

ˆ

=

‰Š ‹

wŒ

• Ž‰• •‘ ’

=

7

“,

FIGURE 5. The Speed-up of Parallel C-GMM Algorithm

030146-7

(10)

where 7 has a Poisson-mixing-Gamma distribution [14]. However, this paper limits the discussion for only finite values of MSE which is L = 0 and assumes that Δ = 1 to simplify the problem. In this paper, it will be generated a sample of size J = 250 with a weekly sampling frequency in mind based on the true parameter which assumed same with the parameter estimates that obtained from Gallant and Tauchen [15], i.e. • = 0.00285, € = 8.7403509 and • = 0.0275. The performance of parallel algorithm for C-GMM will be analyzed in some subsets of this generated data series where the number of simulations and points in the grid for and regularization parameter is same, i.e. 50 and 30, respectively, across all scenarios to keep the consistency of the results for each the number of observations that used in the parallel programming experiment. The speed-up is a measure for measuring the effectiveness of parallel algorithm that has been designed [16] which is defined as follows, š P, w =

J P, 1 , J P, w

where J P, w is a computational time required for parallel the algorithm in processing a problem of size n with p threads w > 1 whereas J P, 1 is denoted for sequential algorithm. Figure 5 presents the speed-up of parallel the C-GMM algorithm: outer-loop, inner-loop and their combination, with 8 threads for all scenarios. It shows that on average, the outer-loop parallelization is the best strategy in this experiment, followed by the combination (outer-inner-loop) and the only inner-loop region. However, relatively, the larger the number of observations, the faster speed-up of parallel (inner-loop) C-GMM algorithm but opposite to two others. It might be caused by the optimization of 5 2 E in the inner-loop region which is directly related to the number of observations. Different results might be obtained if the experiment is done by using more threads and different generated sample.

CONCLUSIONS There are only two of four parallel regions in the original C-GMM algorithm, especially in optimizing regularization parameter: the outer-loop and the inner-loop, that are contributed significantly to the reduction of computational time. They are both classified as SPMD model. These parallelization strategies is effectively programmed under stochastic environment with standard shared-memory API, i.e. OpenMP 3.1 using directive compiler in C with 8 threads, that is implemented in the parameter estimation of Cox-Ingersoll-Ross model. Moreover, the outer-loop parallelization is the best strategy for any number of observations in this experiment. Parallel Algorithm for C-GMM can be developed on the Graphics Processing Unit (GPU) by using Compute Unified Device Architecture (CUDA) tool for implementing in the GPU card [17]. Moreover, parallelization can be also designed for developing C-GMM method itself, i.e. Indirect Continuous-Generalized Method of Moments (ICGMM) [18].

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

See supplementary material at https://computing.llnl.gov/tutorials/parallel_comp/ for Introduction to Parallel Computing. A. Bustamam, K. Burrage, and N. A. Hamilton, IEEE/ACM Trans. Comput. Biol. Bioinf. 9, 679–692 (2012). M. Carrasco and J. P. Florens, Econometr. Theor. 6, 797–834 (2000). M. Carrasco, M. Chernov, J. P. Florens, and E. Ghysels, J. Econom. 140, 529–573 (2007). M. Carrasco and R. Kotchoni, Cirano Scientific Series 22, 1 (2013), see www.cirano.qc.ca/pdf/ publication/2013s-22.pdf G. C. Cawley and N. L. C. Talbot, J. Mach. Learn. Res. 11, 2079–2107 (2010). B. Chapman, G. Jost, and R. V. D. Pas, in Using OpenMP: Portable Shared Memory Parallel Programming, (The MIT Press, Cambridge, 2008). J. C. Cox, J. E. Ingersoll, and S. A. Ross, Econometrica 53, 385–408 (1985). D. Gaffen, The Wallstreet Journal (2009), available at https://blogs.wsj.com/marketbeat/2009/03/09/ measuring-arbitrage-in-milliseconds/ A. R. Gallant and G. Tauchen, J. Am. Stat. Assoc. 93, 10–24 (1998). R. Kotchoni, Comput. Stat. Data Anal. 76, 464–488 (2014). J. A. Nelder and R. Mead, Comput. J. 7, 308–313 (1965).

030146-8

14. B. Parhami, in Introduction to Parallel Processing: Algorithms and Architectures, (Kluwer Academic, New York, 2002). 15. S. Patterson, The Wallstreet Journal: Business (2010), see https://www.wsj.com/articles/ SB10001424052748703340904575285002267286386 16. S. Rosov, AAM Journal of Investments and Pensions 19 (2014) see http://www.asiaasset.com/aam/201411/1114_insight.aspx 17. K. Singleton, J. Econom. 102, 111–141 (2001). 18. B. Wilkinson and M. Allen, in Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers (Prentice Hall, New Jersey, 1998). 19. H. Zhou, J. Comput. Financ. 5, 89122 (2001).

030146-9

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & Close