Utilizing general-purpose computing on graphic processing units for engineering algorithm speedup.
Teodor Ande Elstad Simen Haugerud Granlund
Master of Science in Engineering and ICT Submission date: June 2014 Supervisor: Ole Ivar Sivertsen, IPM Co-supervisor: Bjørn Haugen, IPM Adel Chemaly, Technosoft Inc. Alok Mathur, Technosoft Inc. Norwegian University of Science and Technology Department of Engineering Design and Materials
THE NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY DEPARTMENT OF ENGINEERING DESIGN AND MATERIALS
MASTER THESIS SPRING 2014 FOR STLIDJECHN. TEODOR ANDRE ELSTAD AND S1MEN HAUGERUD GRANLUND UTILiZING GENERAL-PURPOSE COMPUTING ON GRAPHIC PROCESSING UNITS (GPGPU) FOR ENGINEERING ALGORITHM SPEED UP ngshastighet Utnytte beregnirigskraft I grafiske prosessorer (GPGPU) til oke beregni for ingenioralgoritmer. Modeling Most software tools for KBE development is today based around special purpose approach this While Inc. Soft frameworks like the AML KBE language developed by Techno ities capabil the ng extendi for ents requirem has been very successful there are continuously ance. perform ational comput both regarding new functionality and -purpose TechnoSoft would like to have certain algorithms sped up, by utilizing general will be to thesis master of the goal U). The computing on graphics processing units (GPGP Soft Techno in be used could that ms algorith ed -enhanc implement a small library of GPGPU applications. The technology to be investigated is NVIDIA including the CUDA arch itechure and QpenCL. grid Towards the end of the work with the master assignment, when algorithms for ms are algorith these if will study tes candida smoothing for CFD analysis are developed, the . entation or implem process well suited for speed up by graphic The assignment includes: I
A study of advantages offered by using GPGPU. How to best utUize the hbjh canacity for pa.raei processing? What kind of orobierns can be sped un? o
sanWWIW
sss
ccc esss cc
or OUDA?
advisers. 4. S&ect engineering algorithms for test applications, in cooneration with the graphic the on eniation onpiem up for t sneed ieievan and y suited’ that are wail processor.
5. Extensive performance testing of different implementations on different problem sizes and hardware. Are we getting good speed up on real life problems and hardware, or is the overhead associated with parallel computing hampering performance Three weeks after start of the thesis work, an A3 sheet illustrating the work is to be handed in. A template for this presentation is available on the PM s web site under the menu “Masteroppgave” çhttp: www ntnu.noipm’masteroppgave). This sheet shoula be updated one week before the Masters thesis is submitted. Performing a risk assessment of the planned work is obligatory. Known main activities must be risk assessed before they start, and the form must be handed in within 3 weeks of receiving the problem text. The form must be signed by your supervisor. All projects are to be assessed, even theoretical and virtual. Risk assessment is a running activity, and must be carried out before starting any activity that might lead to injury to humans or damage to materials/equipment or the external environment. Copies of signed risk assessments should also be included as an appendix of the finished project report. The thesis should include the signed problem text, and be written as a research report with summary both in English and Norwegian, conclusion, literature references, table of contents, etc. During preparation of the text, the candidate should make efforts to create a well arranged and well written report. To ease the evaluation of the thesis, it is important to cross-reference text, tables and figures. For evaluation of the work a thorough discussion of results is appreciated. The thesis shall be submitted electronically via DAIM, NTNU’s system for Digital Archiving and Submission of Master’s thesis. Contact persons: From the industry: Alok Mathur, Technosoft Inc., Email: aok,mathutechnosctt,corn Adel Chemaly, Technosoft Inc., Email:
[email protected] From NTNU: Bjorn Haugen
I, -
.
scr
e
.rvgcn,kapitligc. av
‘r€
cfesso Sevso
Utilizing general-purpose computing on graphic processing units for engineering algorithm speed up Master thesis spring 2014
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad Department of Engineering Design and Materials Faculty of Engineering Science and Technology Norwegian University of Science and Technology
Abstract The goal of this thesis, is to investigate the possible benefits of using general-purpose computing on graphics processing units (GPGPU), in order to speed up the execution of calculations in engineering applications. The thesis focuses primarily on speeding up the process of analyzing laser scan point clouds, in software developed by TechnoSoft Inc. This is achieved by developing faster algorithms for solving the k-nearest neighbors (kNN), and All-kNN problem, optimized for point cloud data. A parallel brute-force algorithm is developed, which is capable of solving the kNN problem up to 70 times faster than a similar algorithm developed by Garcia et.al. [13], when working on comparable data. By utilizing the k-d tree data structure, a parallel tree-build and query algorithm is developed, suitable for solving the All-kNN problem. This algorithm improves the result obtained by the brute-force algorithm by up to 300 times.
Utilizing general-purpose computing on graphic processing units for engineering algorithm speed up
3
Sammendrag M˚alet med denne avhandlingen, er a˚ utforske mulighetene knyttet til a˚ anvende GPGPU (general-purpose computing on graphics processing units) for a˚ forbedre ytelsen til tunge beregninger i programvarebaserte ingeniørverktøy. Avhandlingen tar for seg en problemstilling, knyttet til analyse av punktskyer, i programvare utviklet av TechnoSoft Inc. Ytelsen i denne programvaren blir forsøkt økt, ved a˚ utvikle raskere algoritmer, for løsning av kNN (k nærmeste naboer) og Alle-kNN problemet, optimalisert for punktskybaserte data. En GPU-parallellisert brute-force algoritme blir utviklet, som er i stand til a˚ løse kNN problemet 70 ganger raskere enn en tilsvarende algoritme, utviklet av Garcia et.al. [13], anvendt p˚a tilsvarende data. Ved a˚ anvende k-d trær, en GPU-parallellisert algoritme blir utviklet, egnet for a˚ løse Alle-kNN problemet. Denne algoritmen forbedrer resultatet oppn˚add gjennom bruk av brute-force algoritmen med 300 ganger.
4
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
Preface This thesis is based on work conducted by Stud.techn. Simen Haugerud Granlund and Teodor Ande Elstad in the spring of 2014, as a part of a collaboration between the department of Engineering Design and Materials, Norwegian University of Science and Technology (NTNU), and TechnoSoft Inc. (TSI). The authors would like to thank Professor Ole Ivar Sivertsen, NTNU, and Alok Mathur, TSI, for help, guidance, feedback and inspiration throughout the project, as well as Associate Professor Bjrn Haugen and Adel Chemaly at Technosoft Inc. for their support.
Trondheim, June 10, 2014
Simen Haugerud Granlund, Teodor Ande Elstad Utilizing general-purpose computing on graphic processing units for engineering algorithm speed up
5
6
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
Utilizing general-purpose computing on graphic processing units for engineering algorithm speed up
7
8
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
Table of Contents
Preface
5
Table of Contents
11
List of Tables
13
List of Figures
16
List of Algorithms
17
List of Research Questions
19
Abbreviations
20
1
Introduction
2
Background 2.1 The k-nearest neighbors search problem . . . . . . . . . . . . . 2.2 A short introduction to parallel programming principles . . . . . 2.2.1 Shared memory architecture . . . . . . . . . . . . . . . 2.2.2 Distributed memory architecture . . . . . . . . . . . . . 2.2.3 Parallel speedup . . . . . . . . . . . . . . . . . . . . . 2.2.4 The APOD design cycle . . . . . . . . . . . . . . . . . 2.3 A short introduction to GPU programming and CUDA . . . . . 2.3.1 General-purpose computing on graphics processing units 2.3.2 NVIDIA GPU architecture . . . . . . . . . . . . . . . . 2.3.3 CUDA programming model . . . . . . . . . . . . . . .
3
1
. . . . . . . . . .
3 3 4 5 6 6 7 7 8 8 10
The quest for a faster kNN search 3.1 A short evaluation of OpenCL and CUDA . . . . . . . . . . . . . . . . . 3.1.1 Matrix multiplication benchmark . . . . . . . . . . . . . . . . .
11 12 12
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
9
3.2
3.3
3.4
3.5
3.6 4
3.1.2 A quick evaluation of available documentation . . . . . . A brute-force based approach . . . . . . . . . . . . . . . . . . . . 3.2.1 Progress made by Garcia et.al. . . . . . . . . . . . . . . . 3.2.2 Optimizing the brute-force algorithm for point cloud data . Application of k-d trees to the kNN problem . . . . . . . . . . . . 3.3.1 Why k-d trees? . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Building k-d trees for point cloud data . . . . . . . . . . . 3.3.3 Querying the k-d tree . . . . . . . . . . . . . . . . . . . . 3.3.4 Testing a serial k-d tree based kNN solver . . . . . . . . . Development of a parallel k-d tree build algorithm . . . . . . . . . 3.4.1 From recursive to iterative implementation . . . . . . . . 3.4.2 Parallelization strategy . . . . . . . . . . . . . . . . . . . 3.4.3 Parallelization of Subtree-Balance . . . . . . . . . . . . . 3.4.4 Further improvements . . . . . . . . . . . . . . . . . . . Development of a parallel k-d search algorithm . . . . . . . . . . 3.5.1 Parallelization strategy . . . . . . . . . . . . . . . . . . . 3.5.2 From recursive to iterative implementation . . . . . . . . 3.5.3 CUDA implementation . . . . . . . . . . . . . . . . . . . CUDA Optimizations . . . . . . . . . . . . . . . . . . . . . . . .
Final results, discussion and conclusion 4.1 Test environment . . . . . . . . . . . . . . 4.2 Final results and discussion . . . . . . . . . 4.2.1 Solving the kNN problem . . . . . 4.2.2 Solving the All-kNN problem . . . 4.2.3 Parallelization performance increase 4.3 Conclusion . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
14 14 15 15 20 20 21 23 25 28 29 29 31 34 37 37 38 38 43
. . . . . .
. . . . . .
. . . . . .
. . . . . .
45 45 46 46 49 52 55
Bibliography
56
Appendix A Data sheets
59
Appendix B Source code documentation B.1 Api documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Installation instructions . . . . . . . . . . . . . . . . . . . . . . . . . . .
63 63 66
Appendix C Source code C.1 API header . . . . . . . . . C.2 Brute Force . . . . . . . . . C.2.1 Bitonic sort version . C.2.2 min-reduce version . C.3 k-d tree build . . . . . . . . C.3.1 Radix select . . . . . C.3.2 Parallel quick select C.3.3 Multiple radix select C.3.4 Parallel k-d tree build 10
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
71 . 71 . 72 . 72 . 79 . 88 . 89 . 97 . 102 . 111
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
C.4 k-d tree search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.4.1 CUDA k-d tree search . . . . . . . . . . . . . . . . . . . . . . . 119 C.4.2 OpenMP k-d tree search . . . . . . . . . . . . . . . . . . . . . . 133 Appendix D Risk assessment
Utilizing general-purpose computing on graphic processing units for engineering algorithm speed up
139
11
12
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
List of Tables
3.1.1 3.1.2 3.2.1 3.4.1
Query results from Stackoverflow . . . . . . . . . . . . . . . . . . . . . Query results from Google . . . . . . . . . . . . . . . . . . . . . . . . . Selected results from Figure 3.2.2. . . . . . . . . . . . . . . . . . . . . . Development of a k-d tree build with a million points, showing how the different tree level operations varies throughout the build process. . . . .
14 14 19
4.1.1 Tabulated information about the three test environments. . . . . . . . . .
46
35
13
14
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
List of Figures
2.3.1 A visualization of the memory hierarchy in CUDA. . . . . . . . . . . . . 2.3.2 The relationship between threads and blocks in a grid [3]. . . . . . . . . .
9 10
3.1.1 Matrix multiplication benchmark results . . . . . . . . . . . . . . . . . . 3.2.1 A visualization of the parallel min-reduce operation. . . . . . . . . . . . 3.2.2 Three different kNN brute-force implementations. The timing results is based on a k equal to 10. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 A set of points on a plane, with a possible k-d tree indicated. . . . . . . . 3.3.2 Tree representation of the points in Figure 3.3.1. . . . . . . . . . . . . . . 3.3.3 Timing results for recursive k-d tree building . . . . . . . . . . . . . . . 3.3.4 Timing results for mean query time with k equal to one . . . . . . . . . . 3.3.5 Comparison of mean query time with k equal to one with fast brute-force and recursive k-d tree based algorithms . . . . . . . . . . . . . . . . . . 3.3.6 Comparison of timing of n queries with k equal to one with fast brute-force and recursive k-d tree based algorithms . . . . . . . . . . . . . . . . . . 3.4.1 Development of subtasks as the kd-tree generation progresses. It shows, at each tree level, how many nodes there is to parallelize and how big each node balancing is. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 An illustration of radix selection [9]. . . . . . . . . . . . . . . . . . . . . 3.4.3 Timing results from a intermediate version of the parallel k-d tree build algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Visualization of the final improvments on the k-d tree build implementation. 3.5.1 The stacks memory usage, compared to the amount of shared memory. Here k was sat to 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Search time comparison between different stack memory types. The test are done with k equals 10 and 1 · 106 queries per tree size. . . . . . . . . 3.5.3 Timing results from two different k-heap implementations, with varying k and 1 · 106 referance points. . . . . . . . . . . . . . . . . . . . . . . . .
13 18 19 21 22 25 26 27 27
30 32 35 36 41 42 42 15
3.6.1 Three different memory transactions, where A and B result in good coalesced and cached transactions, while C shows a stride access pattern with bad coalescing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Comparison of brute-force algorithm developed by Garcia et.al. and minreduce based brute-force algorithm developed in this thesis. k is fixed at 10 and m is increasing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Comparison of min-reduce brute-force and k-d tree based algorithms for solving the kNN problem for k = 100 and increasing m ≤ 1e7. . . . . . 4.2.3 Comparison of min-reduce brute-force and k-d tree based algorithms for solving the kNN problem for m = 1,0 · 106 and increasing k ≤ 200. . . . 4.2.4 Comparison of min-reduce brute-force and k-d tree based algorithms with CPU and GPU parallelized query. The graph compares runtime for solving the All-kNN problem for k = 100 and increasing m. . . . . . . . . . . . 4.2.5 Comparison of min-reduce brute-force and GPU parallelized k-d tree based algorithms for solving the All-kNN problem for m = 1e6 and increasing k ≤ 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Comparison of runtime for GPU (GTX 560) and CPU (OpenMP) parallelized k-d tree based n query. . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Comparison between serial and parallel k-d tree build performance. . . . 4.2.8 Parallel speedup for the k-d tree implementation for varying values of m. 4.2.9 Comparison between serial and parallel All-kNN query performance. . . 4.2.10Parallel speedup comparison for the All-kNN query between the CUDA and OpenMP implementation. . . . . . . . . . . . . . . . . . . . . . . .
16
44
47 48 49
50
51 52 53 53 54 55
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
List of Algorithms
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
General work distribution in CUDA Iterative Bitonic sort . . . . . . . . Serial ⊗-reduction . . . . . . . . . Recursive k-d tree build . . . . . . . Recursive kNN k-d tree search . . . Iterative k-d tree build . . . . . . . . Parallel subtree balance . . . . . . . Iterative kNN k-d tree search . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
16 17 18 22 24 29 33 39
17
18
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
List of Research Questions
RQ 1. Can high performance be achieved by a parallel brute-force kNN algorithm on large point clouds? RQ 2. Can a parallel brute-force kNN algorithm be fast enough to solve the All-kNN problem within reasonable time? RQ 3. It is possible to use a k-d tree to increase the performance of kNN queries, compared to a parallel brute-force solution? RQ 4. It is possible to use a k-d tree to increase the performance of All-kNN queries, compared to a parallel brute-force solution? RQ 5. It is possible to parallelize the k-d tree build algorithm, in such a way that it gives a significant speed improvement compared to the serial algorithm? RQ 6. It is possible to parallelize the All-kNN query algorithm, in such a way that it gives a significant speed improvement compared to the serial algorithm?
19
Abbreviations ALU APOD ATLAS BLAS DMA CPU CUDA GPGPU GPU kNN MPI OpenCL OpenMP RQ SIMT SM SMA TSI
20
= = = = = = = = = = = = = = = = = =
Arithmetic logic unit Assess, Parallelize, Optimize, Deploy Automatically Tuned Linear Algebra Software Basic Linear Algebra Subprograms Distributed memory architecture Central processing unit Compute Unified Device Architecture General-purpose computing on graphics processing units Graphics processing unit k-nearest neighbors Message Passing Interface Open Computing Language Open Multi-Processing Research question Single-Instruction, Multiple-Thread Streaming Multiprocessor Shared memory architecture TechnoSoft Inc.
Stud.techn. Simen Haugerud Granlund, Stud.techn. Teodor Ande Elstad
Chapter
1
Introduction In more recent years, the hardware developers have increased the performance of new systems, by utilizing the power of parallel processing. The demand for better computer graphics has been a driving force for creating more powerful parallel computing hardware. Powerful graphics cards, which features a highly parallel architecture, is today available in the consumer market to a relatively low price. Unfortunately, this parallel power is not automatically utilized by traditional algorithms written with sequential execution in mind. In addition, parallelization is not a magic bullet. It can deliver blisteringly fast execution when the type of problem permits it, but for many classes of problems, benefiting from parallel processing capabilities, is not trivial, or even possible. Due to this need for different algorithms and specialized knowledge, in order to harness the power of parallel execution, a lot of software is currently not benefiting from the possible performance of modern hardware. This is a point to be concerned with, in relation to engineering software, where complex, time consuming computations is commonplace. In our thesis, we have studied how to utilize the power of general processing of graphical processing units (GPGPU) in engineering software applications. We have studied this problem, by developing faster, parallel algorithms, for use in point cloud analysis software, made by TechnoSoft Inc (TSI). During the course of our work, we discovered that some of the assignment topics was not relevant, in the way initially envisioned. At the same time, new possibilities arose, that was not accounted for during the draft of the assignment text. In collaboration with our thesis advisors, we therefore choose to divert some from the original assignment. An evaluation of GPGPU libraries like ViennaCL and cuBLAS was expected to be relevant for the thesis, but this proved to be entirely irrelevant for the particular problem we studied, since we ended up not being able to rely on any prefabricated GPGPU library. In addition, 1
Chapter 1. Introduction the grid smoothing algorithms for CFD analysis was not developed in time to be a part of our work, so this part of the assignment was left out as well.
2
Chapter
2
Background The purpose of this chapter, is to cover some background material, giving the reader a foundation for the later chapters. A introduction to the kNN problem is given, and it’s relation to All-kNN, and Q-kNN problem is discussed. In addition, relevant parallel programming principles, practices and tools are presented.
2.1
The k-nearest neighbors search problem
The k-nearest neighbors (kNN) search problem, is, the problem of locating the k closest neighbors to a given query point, with k being some positive natural number. This is intuitively a quite simple problem to grasp, and for most of our purposes, an intuitive understanding of the problem will be sufficient, but let us start by looking a bit closer at the properties of kNN search in general. If we consider p to be a point in d-dimensional space so that p = [x1 , x2 , . . . , xd ]. Given a set of size m, consisting of such points S = [p1 , p2 , . . . , pm ], a additional query point q, also in d-dimensional space, and some distance metric D = f (p, q), the kNN problem is to find k points in S, such that the sum of distances in relation to q is minimized. We introduce the term reference points to be the set S, to differentiate it from the query points Q. It is worth to note that since we are not limited, either in the number of dimensions, or distance metric we choose, the kNN problem is applicable in many different situations. If we e.g. wanted to construct a system for finding beer with similar flavor to one beer we just tasted, we could build a database of different beers, categorized by flavor dimensions like bitterness, maltiness, sweetness, and so on. Then, to find beers similar in flavor to the one we just tasted, we would perform a kNN query on this database, using a suitable distance metric, and the beer we just tasted as our query point. 3
Chapter 2. Background The general nature of the kNN problem makes it relevant in many research and industrial settings, from 3-d object rendering, content based image retrial, statistics and biology [14, Introduction]. When wanting to query point cloud data, one can make some simplifications to this general kNN problem. In our research we are only concerned with three spacial dimensions, and their Euclidean relations. This is not a universal way to simplify the kNN problem to point cloud data, and other dimension might be interesting to add for other applications. Other metrics commonly related to point cloud data, e.g. the color value of each point, could also be interesting to include. But for most applications, TSI application included, three dimensions, and an Euclidean distance metric is all we need. Throughout this text we will use the term kNN to refer to this simplification of the kNN problem. When studying point clouds, it can be interesting to find the k-closest points to all points in the point cloud. In order to compute this, you would simply perform kNN queries, using all the points in the point cloud as query points. In order to refer to this variant of the kNN search problem, we will use the term All-kNN. Another similar variant of the kNN problem, is application of the kNN algorithm to a set of query points of size q. In this version of the problem, you are not limited to the query point set being the points in the point cloud, it can be any set of three dimensional points. We will refer to this problem variant as Q-kNN, and note that All-kNN is a subproblem of Q-kNN.
2.2
A short introduction to parallel programming principles
Parallel programming is programming targeting parallel computing, where sections of the program is executed simultaneously on multiple cores, processors, machines, or other suitable environments. We use the term parallelization to mean transformation of computational instructions intended for sequential execution to simultaneous, or parallel, execution. Parallelization can be introduced on several different levels in a computer. Bit-level parallelization, where bit level operations is parallelized within a processor as is the case for 64-bit, 32-bit, 16-bit, etc. processors, being a common low level form. In order to avoid confusion, this text is not concerned with such levels of parallelization, but will instead focus on higher level parallelization, related to developing and implementing algorithms in a regular programming language. We will also use the term parallel unit as a general term for a single computational unit in a theoretical parallel machine, with a unlimited number of parallel units. It is easy to grasp that parallelization can speed up the execution of a program. Given a problem where we want to add one to every element in a list of numbers. In a sequential 4
2.2 A short introduction to parallel programming principles program we could go through the list of numbers, adding one to each element in each step. This would roughly take the time needed for adding one number to another, times the number of elements in the list. In out theoretical parallel machine, we can simply assign all the elements in the list to a different parallel unit and ask every unit to add one to it’s assigned number. This would take much less time than the sequential algorithm, just the time needed to add a number, since all numbers in the list is added at the same time. Adding one to the elements of a list is a trivial problem, and unfortunately not all problems can be parallelized as easy as this. Even simple problems can be hard to parallelize. Consider adding all the numbers together, instead of adding one to each number. This is a very similar problem, we still only has to add a number to all of the elements in the list. The problem is that we does not know what to add to a given element before all the previous elements has been added together. We could calculate the sub-sum of small subsections of the list in parallel, and then add the resulting sub-sums together in a sequential fashion, but then a part of our program does not execute in parallel. This exemplifies, that for many problems, we cannot entirely parallelize the execution, because some data has to be transfered between the threads. We need a way to let the parallel units communicate with each other. Communication is often a large factor in limiting the performance that can be harnessed from parallelization. Communication can be the source of implied sequential execution, since the parallel units have to wait for data from another unit. It is also often inherently slow, and carries a high overhead due to low data transfer speeds between hardware components. The possibility of minimizing the amount of communication, is therefor a major factor in determining if a sequential algorithm can be successfully parallelized.
2.2.1
Shared memory architecture
In a parallel computer using shared memory architecture (SMA), the different parallel units all share a global shared memory. The parallel units usually are different processors, often located within the same physical chip like a multi-core CPU. This is the architecture used by most modern desktop computers. Different varieties exist, with separate processor cache for each processor core, shared cache or even a combination, with shared L2 cache and distributed L1 cache. From the point of view of this thesis, all these varieties would fall under the SMA classification. Shared memory is easy to work with and understand, and communication between different parallel units can be facilitated quite easily, and relatively fast, by reading and writing the same memory. The drawback is that the programmer must ensure that the different parallel units does not try to access the same memory at the same time, or in the wrong order. SMA still works well for smaller parallel computer systems. 5
Chapter 2. Background
2.2.2
Distributed memory architecture
In a computer system using distributed memory architecture (DMA), each parallel unit contains both processor and memory. The processor can only access data from it’s own local memory, and if data is needed from another parallel unit, it has to be transfered from that units local memory into the memory of the unit in need of the data. DMA computer systems usually scales better when using a high number of parallel units, compared to SMA systems, since one memory does not have to facilitate all parallel units. The drawback is that communication carries a higher overhead, since data has to be transferred between the different local memories of each unit. Computer systems using a distributed memory architecture, often resemble many individual computers, working in parallel on the same problem, and communicating using specialized networking components. This is the architecture favored in todays supercomputers. Many varieties exist, especially concerning the network layout between the different machines. Since each individual parallel unit in such systems usually can be considered to be an individual computer, each unit often has a internal shared memory architecture, like desktop computer. This makes the entire system capable of harnessing the strength of both SMA and DMA, but increases the complexity that the programmer has to handle. As we will discover later in the text, GPUs are organized using an architecture with this kind of hybrid architecture.
2.2.3
Parallel speedup
Parallel speedup, or just speedup as it is often called, is a measure of how much faster a parallel algorithm is compared to it’s sequential counterpart. Let Ts be the execution time of the sequential algorithm, and Tp be the execution time of the parallel algorithm on a system with p parallel units. The speedup Sp is then defined as Sp = Ts /Tp . In the ideal situation, the relation between the speedup and the number of parallel units will be linear. Due to overhead related to possible communication, and the use of a more complex framework for the parallel code, this is usually not possible to achieve. We therefore have the parallel efficiency metric Ep = Sp /p, which describes how much is lost to such factors. Speedup and efficiency can be a good measure of how well the algorithm is parallelized, but it can not necessarily be used to determine if one parallel algorithm is faster than another. Parallelizing many inefficient brute-force algorithms can be done with excellent speedup and efficiency, but the resulting algorithm will often be considerably slower than the parallel version of a better sequential algorithm, or even just a better sequential algorithm. Parallelizing bad code will result in just as bad code, and thorough study of efficient algorithms should always be carried out before attempting any parallelization. 6
2.3 A short introduction to GPU programming and CUDA
2.2.4
The APOD design cycle
In order to work efficiently with parallelization problems, many programmers adopt a work-flow similar to the APOD design cycle[2, Preface]. The cycle consists of four steps: 1. Assess: Locate the parts of the code which take up most of the run-time. Re-writing an entire application for parallel execution is usually very hard and time consuming, if it is even possible. The best results for the user might be to just parallelize the one algorithm that is taking a long time to execute. Parallelization might not even be the answer, a faster sequential algorithm could also be a solution or part of the solution. 2. Parallelize: Investigate if any pre-written library of parallel functions could be used as part of, or the entire solution. Try to pinpoint which part of the code that can execute in parallel, and which part is dependent on communication. If the code is inherently dependent on communication, sequential alternatives, more suitable for parallelization, might be researched. Just not be tempted to use a inefficient algorithm because it is easy to parallelize. 3. Optimize: Dependent on the parallel computer system, programming language, and other tools you are using, a number of conventional optimization strategies might be applied. This should not be forgotten when writing parallel code. On a CUDA GPU you will often want to check that you are using the optimal type of memory for different tasks, minimize the divergence in the code and try to optimize the use of arithmetic operations. 4. Deploy: Run your application on real hardware, test thoroughly, and compare the results to the original. Did the parallelization actually increase the performance of the application? Deploy the code to potential users. They will benefit from the increased performance, and you will get feedback if any bugs exist.
2.3
A short introduction to GPU programming and CUDA
Moore’s law has been a gift to all programmers the past 50 years. The law predicts that performance of integrated circuits would double every two years, and it has become the de facto standard for computer processing capabilities since it was first stated. However, since the so called Power Wall in 2002, the world of computer hardware performance has been changing. In order to keep up with Moore’s law, the hardware vendors have been mowing from single core processors, where all computation is performed by one fast processor core, to processors with multiple cores, working in parallel. As a result, many programs and algorithms has to be rewritten, in order to benefit from this new hardware architecture. In recent years, many tools for working with parallel programming, has been developed. Frameworks and libraries like OpenMP, CUDA, MPI and OpenCl being some of the 7
Chapter 2. Background more notable examples. These tool support various types of parallel hardware architecture. OpenMP supports parallelization, targeting shared memory architecture. It is an implementation of multi threading, whereby a master thread divide the workload to different forked slave threads. The Message Passing Interface (MPI), is a library specification for message passing, often used for parallel programming, targeting distributed memory architecture. It is maintained and promoted by a committee, consisting of a wide selection of academics, hardware and software vendors. In the consumer marked, GPUs represent hardware with a high number of parallel units compared to the price. The NVIDIA GeForce GTX 480 GPU is e.g able to execute 23,040 threads in parallel [16]. The reason GPUs can have such a massive amount of parallel units at this price point, compared to traditional hardware like a CPU, is that each parallel thread is very lightweight. Individually, each thread has relatively low performance, but together they can achieve an extremely high instruction throughput. This makes targeting the GPU, a good solution for high performance parallel computing on a desktop computer.
2.3.1
General-purpose computing on graphics processing units
GPGPU is the utilization of a GPU in applications to perform heavy computations, normally handled by the CPU. This is most efficiently accomplished by using GPGPU specific programming tools. Two widely tools approaches are the Compute Unified Device Architecture (CUDA) and the Open Compute Language (OpenCL). OpenCl is a low-level framework for heterogeneous computing for both CPU and GPU’s. It includes a programming language, based on C99, for writing kernels. Kernels are methods that executes on the GPU. It also includes an API that are used to define and control the platform. In contrast to the open OpenCL, the dominant proprietary framework, CUDA, is only designed for GPU programming. It is, as OpenCL, based on a programming language and a programming interface. Science CUDA is created by the vendor, it is developed in close proximity with the hardware. For a deep and thoroughly survey of GPGPU programming, techniques and applications take a look at John D. Owens article from 2007 [19].
2.3.2
NVIDIA GPU architecture
A CPU is usually optimized for low memory latency. This enables the CPU to quickly fetch data and instructions from memory. For a chip to achieve low latency, it needs to have a large amount of cache available. This makes it hard to physically fit a very many cores on a single chip. 8
2.3 A short introduction to GPU programming and CUDA The GPU is optimized for high throughput. Throughput is a metric of how fast a processor is able to process data, and this is desirable when processing graphics, where a data has to be updated fast, in order to redraw the screen. To achieve high throughout, a large number of cores, or ALUs, are needed. GPUs is therefore usually organized with a small control unit and cache, but a large number of cores. A NVIDIA GPU is build around a scalable array of multi-threaded Streaming Multiprocessors (SMs). A multiprocessor are designed to execute hundreds of threads concurrently. It is organized according to an architecture called Single-Instruction, Multiple-Thread (SIMT), where each SMs creates, manages, schedules and executes parallel threads in groups of 32, called a warp. Threads composing a warp starts at the same program address, but they have their own register state and instruction address, and is therefore free to branch and execute independently [3].
Figure 2.3.1: A visualization of the memory hierarchy in CUDA.
An other impotent part of the GPU architecture is the memory hierarchy, see Figure 2.3.1. Global memory is bottom-most part of this hierarchy and is analogous to RAM on the CPU. The global memory can be used to transfer memory to and from the host CPU. Each SM contains a fast L1 on-chip memory, which is divided into a shared memory and a cache. The fast shared memory is shared across each thread in a block, and resembles the shared memory found on the CPU. The threads also have their own 32-bit registers. Other types of memory also exist. Constant memory and texture memory are read-only, and therefore highly cacheable. Compared to the CPU, the peak floating-point capability and memory bandwidth of the GPU, is an order of magnitude slower [20]. As better hardware is developed by NVIDIA, some properties change. These changes are clustered into versions, called compute capabilities. All NVIDIA devices are backward 9
Chapter 2. Background compatible, so a device with compute capability of 3.x also have all properties that is found in compute capability 1.x.
2.3.3
CUDA programming model
The CUDA programming model is designed to make the heterogeneous GPU/CPU programming easier. The GPU works as a computation accelerator for the CPU, and the programming model should therefore be a bridge between the two. CUDA have created this bridge based on a runtime library, compiler and C language extensions. The C language extensions, enables the programmer to create and launch methods on the GPU, through the runtime library. These methods are called kernels. A CUDA program, is based on a data-parallel behavior, where threads are executed in parallel. The execution of a kernel, is organized as a grid of blocks consisting of threads, see Figure 2.3.2. When a kernel grid is launched on the host CPU, blocks of the grid is enumerated and distributed among the SMs. The blocks then executes concurrently on each SM. Only threads in a block can communicate with each other. This is done by creating synchronization walls. Communicate can also be facilitated through the fast shared memory, located on the individual SMs. Each block and thread have their own id, which often is used to determines what portion of data the thread should process.
Figure 2.3.2: The relationship between threads and blocks in a grid [3].
10
Chapter
3
The quest for a faster kNN search TechnoSoft Inc. is currently developing point cloud analysis software library. This software library is developed, with the goal of making comparisons between 3D models of a engineering design, and laser scans of the finished product. Such a comparison, would help engineers to pinpoint production errors faster, and more accurately. In order to get good precision in the laser scan data, a large amount of points in 3D space has to be recorded. This set of data is commonly called a point cloud, and it can easily consist of 1 · 106 to 1 · 108 3D points, or even more, as a larger number of recorded points give better accuracy in the data. Processing such a large number of point, is a time consuming task, and reducing the time required for individual operations become important, since they will be repeated many times. Working on point cloud data is also a problem seemingly suitable for parallelization, since operations on individual points could be executed concurrently. TSI has analyzed their current point cloud analysis algorithms, and determined that a lot of time is spent solving the All-kNN problem, for the point clouds. They would therefore try to improve performance by developing GPU parallelized algorithms, capable of solving the kNN and All-kNN problem for a high number of points, at least in the order of 1 · 107 to 1 · 108 , and a low value for k ≤ 100. In this chapter, we will try to develop algorithms capable of solving the kNN and All-kNN problem, with the performance required by TSI. 11
Chapter 3. The quest for a faster kNN search
3.1
A short evaluation of OpenCL and CUDA
As mentioned in Section 2.3.1, there are two dominant frameworks for GPGPU programming, CUDA and OpenCL. They both have their strengths and weaknesses, and in order to determine which was the most suitable for our work, we performed a small evaluation of both. This evaluation was based on a short benchmark test, where a matrix multiplication application was developed in both frameworks. The development time for both the CUDA and OpenCL matrix multiplier was limited, in order to highlight any differences in ease of use, between the two frameworks. In addition, a quick analysis of available documentation for both frameworks was made, using common online search engines. In all our tests CUDA outperformed OpenCL. Although our tests were very limited in scope, they support the opinion that currently, CUDA is faster and better documented than OpenCL. If the portability offered by OpenCL is not required, we would recommend using CUDA for GPGPU programming.
3.1.1
Matrix multiplication benchmark
In order to compare the performance differences between CUDA and OpenCL, a simple matrix multiplication algorithm was implemented in both CUDA and OpenCL. These implementations where based on examples provided by NVIDIA and AMD. In order to establish a baseline, to which the CUDA and OpenCL results could be compared, additional implementations of the matrix multiplication algorithm was made, as both a naive serial implementation in C and a highly optimized implementation using the Automatically Tuned Linear Algebra Software (ATLAS[1]) implementation of BLAS. Finally, a highly optimized CUDA implementation was made using the cuBLAS[4] library. The test algorithm multiplies two square matrices of size NxN. This is an interesting problem to use for performance benchmarking for a number of reasons: • Matrix multiplication is often used as a subroutine in more advanced mathematical algorithms. • Matrix multiplication can be parallelized over a large number of computational cores, making it suitable for GPGPU programming. • The mathematics of matrix multiplication is trivial, making it an easy to understand example problem. The four implementations where tested on test environments described in Table 4.1.1. The results are presented in Figure 3.1.1 We see that the naive serial implementation quickly becomes unusable, due to a rapid increase in run time. The improvement gained by using ATLAS BLAS is very large compared to the naive implementation, although it cannot keep up with the run times achieved by the CUDA and OpenCL implementations. 12
3.1 A short evaluation of OpenCL and CUDA
Figure 3.1.1: Matrix multiplication benchmark results
The difference between CUDA and OpenCL is quite small, compared to the naive and BLAS implementations, but the CUDA implementation is on average about twice as fast as the OpenCL implementation. This is quite a big difference, and this could be related to all tests being run on a NVIDIA graphics card. It might also have been caused by different quality between the NVIDIA and AMD examples.
Looking at the results for the cuBLAS implementation, we can also see the impact of using a highly optimized library for GPGPU programming. The cuBLAS implementation is faster than using the basic CUDA example, indicating that proper use of libraries can be very beneficial.
It is also important to note that this is a very small test. In order to be able to conclude if CUDA is indeed faster than OpenCL, one would have needed to implement a wide selection of algorithms and test them on several different hardware configurations. Although this test is non conclusive regarding this question, the results seem to support several older investigations, concluding that CUDA is faster than OpenCL. One notable example being A Comprehensive Performance Comparison of CUDA and OpenCL[11] by Janbin Fang et al. 13
Chapter 3. The quest for a faster kNN search
3.1.2
A quick evaluation of available documentation
When we where installing CUDA and OpenCL, and implementing our test algorithms, we relied on the online documentation available for the two GPGPU frameworks. Our subjective experience was that finding good documentation for CUDA was a lot easier than for OpenCL. In order to investigate this, we made a series of queries for pages related to CUDA and OpenCL on Google, Google scholar and Stackoverflow.com (a popular programming QA site). The results are shown in the following tables (all data from 16. Jan 2014). Query Tagged OpenCL Tagged CUDA Open search OpenCL Open search CUDA
No of Stackoverflow.com results 1935 6137 5818 16174
Table 3.1.1: Query results from Stackoverflow
Query opencl paralell programming cuda paralell programming opencl gpgpu cuda gpgpu opencl programming cuda programming
No of Google results 322000 558000 558000 816000 875000 2790000
No of Google Scholar results 7480 17100 5230 13500 8160 22700
Table 3.1.2: Query results from Google
3.2
A brute-force based approach
As a starting point in our quest for a fast kNN search, we investigated relevant literature. Consulting our advisors lead us to two papers by Garcia et.al. [13, 14]. In these papers, a parallel brute-force algorithm is developed, capable of solving the kNN problem orders of magnitude faster than comparable fast serial algorithms. Unfortunately, the algorithm developed by Garcia et.al. has some limitations, especially regarding the number of points that the kNN problem can be solved for. This could be due to the general nature of the algorithm developed by Garcia et.al. as it is optimized for solving problems in a large number of dimensions. By optimizing the algorithm for point cloud data, we could be able to get around this limitation. RQ 1. Can high performance be achieved by a parallel brute-force kNN algorithm on large point clouds. 14
3.2 A brute-force based approach Given a fast kNN algorithm for point cloud data, the All-kNN problem can be solved easily, by applying the algorithm sequentially for all points in the cloud. This, however, is a strategy sensitive to minor inefficiencies in the kNN algorithm. The algorithm will have to be very fast for a single problem, in order to solve the All-kNN problem within reasonable time. Whether this is achievable with a fast brute-force based algorithm will have to be investigated. RQ 2. Can a parallel brute-force kNN algorithm be fast enough to solve the All-kNN problem within reasonable time?
3.2.1
Progress made by Garcia et.al.
The algorithm developed by Garcia et.al. is general in nature, and solves the kNN problem for any given dimension. It does this by performing the following three steps: 1. Compute all distances between the query point q, and all reference points in S. 2. Sort the distances. 3. Pick the k shortest distances. If this general brute-force algorithm is to be used on n query points the time complexity will be O n m d . To an experienced programmer, this might seem like a inefficient choice for a kNN algorithm. Usually, brute-fore based algorithm is frowned upon, but when taking into account parallelization, it can be a valid choice. Brute-force algorithms tend to perform a lot of isolated computations, which can be easily parallelized. Combined with potential poor performance of the serial brute-force algorithm, this gives great speedup, and in some cases actual good performance. In addition brute-force algorithms tend to be very robust on different data, and behave in a predictable manner.
3.2.2
Optimizing the brute-force algorithm for point cloud data
In order to improve performance of the general brute-force algorithm, we want restrict it to 3D space, and optimize it for low values of k. We also want to iron out any implementation choices made, that limit the number of points, m, the algorithm can operate on. Let us start by addressing the limit on number of points. The implementation made by Garcia et.al. only supports problem sizes up to 65535. The limitation of 65535 corresponds to the number of theoretical blocks a CUDA kernel is allowed to spawn. This limitation is therefore only an implementation issue, not an issue with the underlying algorithm, and can be solved by applying a general partitioning algorithm. This algorithm splits the work amongst different CUDA blocks, and is shown in Algorithm 3.1. Additional improvements can be made to the general base algorithm, if we take advantage of our restrictions to the kNN problem. With the dimension fixed to three, the Euclidean distance can be calculated in a more optimized fashion. CUDA consists of lightweight 15
Chapter 3. The quest for a faster kNN search Algorithm 3.1 General work distribution in CUDA Let L be any dividable work quantity of size l. function CUDA-K ERNEL(L) b ← blockIdx.x . Current block id. d ← gridDim.x . Numbers of theoretical blocks in current grid. while b < l do DO - WORK (L(b)) b←b+d end while end function
threads, which makes reducing the amount of arithmetic calculations important for achieving good performance. The general base algorithm, who has to take any number of dimensions into account, uses two vectors and cuBlas to calculate the distance. This increases the complexity and data bandwidth required, and can be removed in our case. The final, and maybe the most important, optimization we want to make, is changing the sorting operation, used to sort the computed distances. For problem instances in a low number of dimensions, the most time consuming operation is the sorting operation. For instance, given a search with 8 dimensions, the sort consumes 62% of the runtime [13].
Bitonic sort It is proven that general sorting can be performed with a time complexity of O m log(m) [10]. This is a costly operation, if one only need the smallest k values in a list, as is the case with the brute-force algorithm. To improve this, a sorting algorithm that sorts the list in a linear fashion, where the smallest elements are sorted first, could be used. This strategy is applied to the general brute-force algorithm. However, using this strategy often forces us to select a sorting algorithm with bad time complexity. For instance, the insertion sort algorithm, used in the general brute-force 2 algorithm [10], has a time complexity of O m , the time complexity of finding the k smallest points will therefore be O m k . Asymptotically, this would give better timing results in cases where k is smaller then log(m). Let us analyze a case with k = 100. For the insertion sort to get any asymptoticly advantage over the best sorting algorithms, the problem size has to be larger than 2100 , which requires a point cloud of 1.3e30. This is unrealistic for our problem. Choosing another sorting algorithm could therefore be a better strategy. Graham Nolan discusses the possibility of improving Garcia’s algorithm by using bitonic sort, and he states that it gives a significant improvement [18]. Bitonic sort is a known O m log(m) algorithm, and is based around a soring network. The network is a series of interleaving bitonic sequences. A sequence is bitonic if it monotonically increases and 16
3.2 A brute-force based approach then monotonically decreases [10]. An iterative version of the bitonic sort is described in Algorithm 3.2. Algorithm 3.2 Iterative Bitonic sort Input: A list L with length m. Output: A sorted list, L P ← {2i |i ∈ N } function B ITONIC -S ORT(L) for all {p ∈ P | p ≤ m} do for all {k ∈ P | p ≥ k > 0} do for all 0 ≤ i < m) do pos ← k Y p if pos < i then if ¬(i&p) then COMPARE (L(i), L(pos)) end if if (i&p) then COMPARE (L(pos), L(i)) end if end if end for end for end for end function
. Y is the bitwise xor operator . & is the bitwise and operator
function C OMPARE(a, b) if a > b then SWAP (a, b) end if end function
Min-reduce Sorting the distances, with a O m log(m) time complexity, still looks like a high price to pay to get the smallest values. Especially if k is reasonably small. Do we need to sort the list in the first place? An algorithm that is more suitable, and also highly parallelizable, is the reduce operation. Cormen [10] defines ⊗-reduction of an array d of size m, where ⊗ is any associative operator, to be the value y, given by the following formula: y = d[1] ⊗ d[2] · · · ⊗ d[m]. In the serial case, this is a typical linear algorithm with time complexity O m , as shown in Algorithm 3.3. 17
Chapter 3. The quest for a faster kNN search Algorithm 3.3 Serial ⊗-reduction Let ⊗ be any associative operator. function R EDUCE(d, ⊗) y ← d[0] for i ← 1, m do y ← y ⊗ d[i] end for end function
Since the operator ⊗ is associative, there is no difference in which way the values are calculated or if it’s done on parallel. A tree based approach, like Figure 3.2.1, could be used. It is a good parallelization strategy, where every possible independent subtask is parallelized. Here each tree level do the associative operations in parallel, the results are combined as the tree level progresses downwards, until only oneelement remains. The parallel equivalent to Algorithm 3.3 is therefore done in O log(n) time.
Figure 3.2.1: A visualization of the parallel min-reduce operation.
To solve our problem the associative operation has to be the minimum operator. Implementing the min-reduce algorithm can easily be done in CUDA, but in order to get the best performance some optimization techniques, like loop unrolling, sequential addressing and warp unrolling, described in Section 3.6 should be applied.
Results Two variations of our restricted brute-force algorithm where implemented, one using the bitonic-sort strategy, and one using the min-reduce operator. Both where compared to the algorithm developed by Garcia et.al. The complete implementations can be found in Appendix C.2 Figure 3.2.2 shows the timing results with k equals 10, for the three different brute-force implementations. We see that the general brute-force algorithm has the worst performance. This is as expected, since it is developed for solving a more general problem, and does not use the optimizations described in our previous sections. The results for the brute-force imple18
3.2 A brute-force based approach mentation using bitonic sort, shows that Graham Nolan’s idea of improving the sorting algorithm give a huge impact. It is almost five times faster then Garcia’s implementation, shown in Table 3.2.1. As a note, bitonic sort has a soring network that is most suited for lengths that is a power of two. This explains the two drops observed in the graph.
Figure 3.2.2: Three different kNN brute-force implementations. The timing results is based on a k equal to 10.
Reference points (m) 6,0 · 105 1,1 · 107
Garcia 231.8ms -
Bitonic sort 48.1ms 1077.2ms
Min-reduce 3.3ms 54.2ms
Table 3.2.1: Selected results from Figure 3.2.2.
The big winner in this comparison is the min-reduce version. It is 70 times faster than the general brute-force algorithm, and almost 15 times faster then the bitonic version.
Although the min-reduce brute-force algorithm is the best choice in this test for low values of k, this is not always the case. Performing k min-reduce operationstakes O k log(m) time, because one min-reduce has a time complexity of O log(m) . If we increase k towards m, the result would be a sorted list, and the time complexity will go towards O m log(m) . This is the same time complexity as our bitonic sort, but a sorting operation with min-reduce have a bigger constant time penalty than the bitonic version. However, for our use case, where k is kept reasonably small, the min-reduce method is far superior. 19
Chapter 3. The quest for a faster kNN search
3.3
Application of k-d trees to the kNN problem
A common strategy when wanting to improve the performance of repeated queries in a large dataset, is to organize the dataset into some data structure suited for fast querying. This strategy trades the additional time required building an data structure, for increased performance on each query. In Section 3.2 we developed an optimized parallel brute-force algorithm for performing kNN queries on a large point cloud. In this section we will investigate the possibility of improving on the brute-force algorithm by using the k-d tree data structure. RQ 3. It is possible to use a k-d tree to increase the performance of kNN queries, compared to a parallel brute-force solution? RQ 4. It is possible to use a k-d tree to increase the performance of All-kNN queries, compared to a parallel brute-force solution? A brief argument for why k-d trees is well suited for kNN query operations is given, then we will present the k-d tree data structure, and show how it can be used for operating on three-dimensional point cloud data. Finally a set of tests are performed on implementations of the k-d tree based algorithms, in order to determine the possible benefits of a parallel k-d tree based algorithm.
3.3.1
Why k-d trees?
A large part of this thesis is devoted to applying k-d trees to the kNN problem. The reader might ask themselves why this is so. Other possible data structures exist which is optimized for querying in geometrical data. Why choose to investigate k-d trees in particular? Part of the explanation has to do with the scope and time resources available for the work in this thesis. Performing a full analysis and parallelization of every possible data structure, and their associated query algorithms, would just not be feasible within our time frame. That said, k-d trees is a very attractive data structure for our use case. • k-d trees are easy to understand and implement, leaving more time to throughly investigate parallelization of the algorithms. • k-d trees are a very minimal data structure, and balanced k-d trees are complete binary trees. This makes reducing the amount of additional memory required in addition to the 3-d points a relative simple task. This is important considering the memory bounds on GPUs, and the time penalty associated with moving data from system memory to GPU memory. • k-d trees are well adapted to performing associative queries, where the query is for a point that is not equal to, but close to the query point. • Studies on parallel kNN queries based on k-d trees has been documented in literature with encouraging results[19, 21, 8]. 20
3.3 Application of k-d trees to the kNN problem
3.3.2
Building k-d trees for point cloud data
A k-d tree can be thought of as a binary search tree in k dimensions. A binary search tree is constructed such that, for a given node, one child-subtree is consisting of elements smaller than the current node, and the other child-subtree is consisting of elements larger than the current node. The same strategy is applied when constructing a k-d tree, but at each level we are sorting the child-subtree elements according to one selected dimension, called the discriminant for this level. This discriminant is cycled through the different dimensions, as we move down each level in the tree. A formal description of k-d trees is given by Jon Louis Bentley in the paper Multidimensional Binary Search Trees Used for Associative Searching[7]. Let us have a look at an example using data for two dimensions. Figure 3.3.1 shows us a set of points on a two dimensional plane. The lines through each point indicate the split plane formed by the discriminant associated with the different points.
Figure 3.3.1: A set of points on a plane, with a possible k-d tree indicated.
The corresponding k-d tree is shown in Figure 3.3.2. Note that lower values in each level are placed in the left branches, and higher values are placed in the right branches. By extending this example with three fixed dimensions for the spatial dimensions, x, y, and z, we get a k-d tree suitable for storing point cloud data. It it possible to construct several algorithms for building k-d trees from a set of points, and one simple approach is using a recursive function. Algorithm 3.4 shows pseudocode for such a simple tree building algorithm. In the pseudocode, we have chosen to represent the different dimensions as a natural number. This means that x is represented by 0, y is 21
Chapter 3. The quest for a faster kNN search
Figure 3.3.2: Tree representation of the points in Figure 3.3.1.
represented by 1, z is represented by 2 and so on. Given a set of point, P , in k space, and a initial split dimension i, it constructs a balanced k-d tree. Algorithm 3.4 Recursive k-d tree build function B UILD -KD-T REE(P , i) if P.length = 0 then . We have reached the end of a branch return NIL else m ← Median(P ) Let L be all elements of P < m in dimension i Let H be all elements of P > m in dimension i i0 ← (i + 1) mod k . k = 3 for a three dimensional k-d tree m.lef t ← Build-KD-Tree(L, i0 ) m.right ← Build-KD-Tree(H, i0 ) end if return m end function Algorithm 3.4 starts by checking if there is any more points left in P . If not, it returns NIL as an end of branch marker. If there still is points left, the algorithm selects the median point, m, as the root node. Then it sorts all remaining points into a collection of points lower than the median, L, and higher than the median, H. The dimension, i, is incremented, and the Build-KD-Tree function is called recursively on both collections of points. Finally the root node is returned, so it can be assigned as the child of it’s parent node, or be used as a global root node. It is worth to note that the performance of this k-d tree build algorithm is sensitive to the 22
3.3 Application of k-d trees to the kNN problem choice of a median finding algorithm, since we will be querying for the median O m times. Choosing to just sorting the collection P , and selecting the median from the middle of the sorted collection, will not give optimal results. Fortunately, several O m median selecting algorithms exist[10] (Get chapter citation), quickselect, being the choice for our initial implementations. Given a fixed number of dimensions, this gives a algorithm with a time complexity of O m log(m) [12]. A final note about Algorithm 3.4, is that it does not handles points with duplicate values in one dimension. If the algorithm where to be feed with a point collection where all points had the same value for x, it would not be able to handle it, since such a point does not explicitly belong in L or H. Several modifications can be made to handle this case. We can choose to place all conflicting median points, except one, in either L or H. The problem with this solution, is that we are not guaranteed to get a balance tree. If we where to have a set of points, where all points where tha same, we would get a tree at all, but just one long branch of length n. Another strategy is to try to place the conflicting medians, equally in L and H. This way the median we select will be the midmost element in the point collection, retaining the balance in the finished k-d tree. Given that we consider that duplicate median points can be located in both subtrees of a node, this will not affect search operations on the tree, as we will see later.
3.3.3
Querying the k-d tree
With a k-dtree we can perform efficient searches for the closest point to a given point in O log(m) average time[12]. By maintaining a collection of the k closest points during execution of the query, we can even perform kNN searches. An example of a kNN search algorithm is shown in Algorithm 3.5. The procedure will take the root of a k-d tree, r and a query point, for which we want to find the k closest points. In addition, it requires a initial dimension, i, which should be the same as the initial dimension used when building the tree. It uses this data to manipulate a collection of the k closest points to q. This collection is called the k-heap, K. The k-heap is a data structure with some special properties. You can query it for the maximum distance value of the k points stored in it, and it will only store a predetermined number of points. If you try to insert more points than the predetermined number of points, it will discard the highest values, and only keep the k lowest values. This data structure can be easily implemented as a modified max-heap [10, Chapter 6]. When the size of the heap is lower than k, it is used in the usual manner, but when the heap is of size k, a slight modification to the insertion operation is made. Instead of adding the new element to the heap, the new element is swapped with the maximum value of the heap, if it is lower than the current maximum value in the k-heap. Then the heap is re-balanced using the standard max-heap balance algorithm. In our code, we assume the k-heap to be filled at the start with k points of either a random sample of points from the k-d tree, or with positive infinity. This way we do not need to check if the heap is filled during the recursive execution of the procedure. 23
Chapter 3. The quest for a faster kNN search Algorithm 3.5 Recursive kNN k-d tree search procedure K NN-KD-T REE(K, r, q, i) if r = NIL then . We have reached the end of a branch return end if d ← Distance(r, q) dx ← r.x[i] − q.x[i] if d < K.max then . Is r closer to q than the current k best points? r.distance ← d Insert(K, r) end if i0 ← (i + 1) mod k . k = 3 for a three dimensional k-d tree if dx > 0 then . Select t and o so we traverse towards closest point first t ← r.lef t, o ← r.right else t ← r.right, o ← r.lef t end if kNN-KD-Tree (K, t, q, i0 ) if dx2 < K.max then . Can there be closer points in the other subtree? kNN-KD-Tree(K, o, q, i0 ) end if end procedure
Algorithm 3.5 starts by checking if we have reached the end of a branch. If not, it calculates the Euclidean distance between the query point, q, and the current root point, r. Calculating this distance is a costly step, since it usually involves calculating a square root. This can be circumvented when implementing, by relying on using the square of the Euclidean distance as the distance metric, instead of the actual distance. This will not make a difference for the algorithm. The distance, dx, between the current root and the query point in dimension i is also calculated. The algorithm then checks if the current root point is closer to the query point than one of the points in the k-heap. If this is the case, it inserts the current root into the k-heap. The next dimension, i0 , is calculated, and then the algorithm determines if it should traverse to the right or left child node first. For efficient querying, we want to traverse down the branch that would contain the query point. In other words, if the query point is lower than the current root point in the current dimension, we want to traverse to the left child, and vice versa. The child node that we want to traverse first, is often called the target, and it’s corresponding subtree is often called the target subtree. In the algorithm the symbol t is used to represent target. The child and child-subtree that is not chosen for immediate traversal is called other and other-subtree. In the algorithm the symbol o is used to represent other. The ability to prune away the other subtree, given our current best estimates stored in the k-heap and the distance dx, is what makes the k-d tree efficient for kNN searches. 24
3.3 Application of k-d trees to the kNN problem After recursively investigating the target subtree, we ask if our estimates in the k-heap is better than the distance dx, remembering that the distances stored in the k-heap is squared. If this is the case, we know that there cannot be a closer point in the other subtree, and we can prune it from our search. If not, we have to check the other subtree as well. When the procedure terminated, the k closest points to the query point is stored in the k-heap.
3.3.4
Testing a serial k-d tree based kNN solver
In order to gain some real world insight into the performance characteristics of k-d tree building and querying, a serial implementation of the build and query algorithm was made. These implementations is available in Appendix C.3 and Appendix C.4.1. These two implementations where then subjected to several tests, using test setup Y. All tests were performed on a set of randomly generated points 3-d points, with the number of points ranging from 105 to 1.41 ∗ 107 . The result of these test are summed up in the following figures. Figure 3.3.3 shows the timing results for the recursive k-d tree build algorithm.
Figure 3.3.3: Timing results for recursive k-d tree building
We observe that constructing a k-d tree for a large number of points is a costly operation. Given a tree of size 1.417 the algorithm uses nearly 9 seconds to construct the tree. We also note that the timing results seem to scale linearly in relation to the number of points. This relates nicely to calculated time complexity of the algorithm. Figure 3.3.4 shows the timing results for querying a k-d tree of a given size. The k-d tree is queried for one point with k = 1. Since we are interested in investigating the average 25
Chapter 3. The quest for a faster kNN search performance, 105 consecutive queries was timed, and the average value for one query was calculated.
Figure 3.3.4: Timing results for mean query time with k equal to one
We see that querying the k-d tree is very fast on average. Querying for one point in a tree of size 1.417 takes about 0.0014 milliseconds. It has to be taken into account, that a query with k = 1 will give the best query time, since the time complexity of the query algorithm scales with k. Still, for queries with a low k, we should expect good performance. The graph also seems to scale with the logarithm of the number of points, as expected by the time complexity calculation. In order to try to answer RQ 3, we compare the timing results gained from the fastest brute-force algorithm developed in Section 3.2. Figure 3.3.5 compare the average time required for building a k-d tree of a given size, and performing a single k = 1 query, to the time required to compute the same result with the fastest brute-force algorithm obtained in Section 3.2. In this comparison, the k-d tree based algorithm does not seem like a good option. When performing just one query, the additional time required to build the k-d tree heavily outweighs the benefit of the improved query time, compared to the brute-force solution. This result is to be expected, since we are not really utilizing the benefit of the k-d tree, but it is still an important point that a brute-force algorithm can be very efficient for certain use-cases. Let us finally look at some results more closely related to the use-case given by TSI. Figure 3.3.6 does the same comparison as Figure 3.3.5, but instead of comparing the time taken to perform one query, n repeated queries are performed, with n being the size of the k-d tree. 26
3.3 Application of k-d trees to the kNN problem
Figure 3.3.5: Comparison of mean query time with k equal to one with fast brute-force and recursive k-d tree based algorithms
Figure 3.3.6: Comparison of timing of n queries with k equal to one with fast brute-force and recursive k-d tree based algorithms
We observe that in this use-case, the k-d tree based approach have much better results than the brute-force based approach. Now the k-d tree only have to be built once, but we benefit 27
Chapter 3. The quest for a faster kNN search from the decreased query time in all n queries. Performing n queries on a point cloud of size 1.417 with the brute-force based algorithm takes about 9,7 · 105 seconds, or about 11 days. With the recursive k-d based algorithm, the same operation can be calculated in just over a minute. Considering the needs of TSI, it seem that this approach is worth developing further into an parallel algorithm. Despite these initial positive results, some problems are apparent from our initial tests. The k-d tree building algorithm is very slow. Given that we want to perform kNN queries on larger point clouds than 1.417 , finding an efficient parallelization of this algorithm would be very beneficial. This is not as trivial as it might seem, as tree-based algorithms do not lend themselves very well to trivial parallelization. When scaling the number of repeated queries from one to n we observed the huge impact a seemingly small change in the time required for performing one query had on the time needed to compute the entire result. A change from several milliseconds to a fraction of a milliseconds might seem trivial, but given enough repeated queries, this was the difference between minutes and days of computation time. Will we be able to keep the query time down when increasing the value of k, and moving the computation over to the GPU, which generally has a slower clock cycle than the CPU. In the next sections we will address these challenges, along with others, and develop a parallel algorithm for performing kNN queries based on k-d trees.
3.4
Development of a parallel k-d tree build algorithm
The build process is by far the most expensive operation on a k-d tree, and parallelizing it could reduce the overall runtime significantly, when solving the kNN problem. This is not as straight-forward as it might seem. The serial k-d tree build algorithm is usually implemented as a recursive function, since recursive functions tend to go along well with tree-based data structures. On the other hand, performance in CUDA is based on efficient use of a massive number of lightweight threads, and to get a fast algorithm one have to split the work between as many threads as possible. This is only possible if it is easy to divide the work into independent subtasks, where data communication is kept at a minimum. In a recursive context, execution flow in hidden inside each threads call stack. Information needed by other threads is therefore not easy to obtain. To solve this, a iterative approach should be used, which can be implemented with a global accessible execution flow. This will however, give a more complex k-d tree build algorithm, and it is not certain that this algorithm is easy to parallelize. RQ 5. It is possible to parallelize the k-d tree build algorithm, in such a way that it gives a significant speed improvement compared to the serial algorithm. In order to investigate RQ 5, we have to look a bit closer at the different steps of the k-d tree build algorithm and investigate different parallelization strategies. 28
3.4 Development of a parallel k-d tree build algorithm
3.4.1
From recursive to iterative implementation
Before we dive into the parallelization strategy and how the parallelization can be done, lets try to make a iterative solution. We can start by enumerating the different steps in the recursive implementation. 1. Find the median of the points along a specified axis. This median point becomes the value of the current node. 2. Sort all points with lower values than the median to the left of the median, and all the points with higher values than the median to the right. 3. Perform this algorithm recursively on the left and right set of nodes. From the steps one can see that for each node in the k-d tree, one have to partition a list around it’s median. We will call this operation Balance-Subtree. If we analyze the Algorithm 3.4, we see that there are two recursive calls. This is logical, because we are building a binary tree where each node have two children. The interesting observation is that a node balance is only dependent on the parent node. This means that each tree level are independent and can be done iteratively. The k-d tree construction basically boils down to successively balance each node in the tree. This leads to a basic reimplementation, see Algorithm 3.6. It goes through each level of the tree, starting at the top, and balances each node successively down the tree. Algorithm 3.6 Iterative k-d tree build Input: An array of points, T Output: T , as a k-d formated array function B UILD -KD-T REE(T ) for all L ∈ {all levels in T } do for all S ∈ L do d ← |L| mod k BALANCE -S UBTREE(S, d) end for end for end function
3.4.2
. k = 3 for a three dimensional k-d tree
Parallelization strategy
Now that an iterative solution have been created, lets start looking at how this algorithm might be parallelized. First a good overall parallelization strategy has to be found. A good strategy manage to easily split the main task into small individual subtasks, that can be performed in parallel, while maintaining a minimal need for communication between the subtasks. When we converted our k-d tree algorithm from a recursive to a iterative solution some interesting observations was made. One observation is that a node balance is only depen29
Chapter 3. The quest for a faster kNN search dent on the parent node. This means that each tree level are independent, which acts as a good start for our parallelization strategy. This also implies that all subtrees in the k-d tree generation are independent. Hence, the tree corresponding to the left and right child of a node can be done in parallel without any communication. The data is also independent, as a result of how we represent the tree as an array. By data independent, it is meant that the data structure easily can be partitioned to each subtask. In our case this will be to successively partition the tree into contentious subtrees. These observation implies that we can divide the tree levels into dependent tasks, where each node balance in a tree level is a independent subtask. This gives us an power of two increasing number of parallel tasks as we go down a tree level. The subtree size will decrease with a factor of two in each downward step, see Fig 3.4.1 This parallelization strategy gives many concurrent operations at the lower level of the tree, but at the initial levels, will hardly get any parallelization at all. To compensate for this one could seek to parallelize the work done in each Balance-Subtree procedure, which also act as a parallelizations strategy. Both strategies can be used in conjunction with each other. The parallel balance node task algorithm can be used to speed up the early iterations, where the amount of nodes in a tree level is small. As well as further parallelize the subtasks in later tree level iterations. This strategy also fit well to our choice of tree representation. One parallel operation can now take the tree representation, split it into subtrees, and balance each one.
Figure 3.4.1: Development of subtasks as the kd-tree generation progresses. It shows, at each tree level, how many nodes there is to parallelize and how big each node balancing is.
CUDA parallelization CUDA have a special architecture that should be taken into account when parallelizing an algorithm. To efficiently use CUDA, the program has to keep thousands of threads 30
3.4 Development of a parallel k-d tree build algorithm occupied, otherwise the benefit of CUDA disappears. The CUDA programming model is build up by a grid if independent block, i.e. execution can not be synchronized across blocks. Execution can only be synchronized between the 1 to 1024 theoretical threads launched inside a block [3]. Thread synchronization is important when multiple threads cooperate on one task, because at some point information has to be exchanged. Our parallelization strategy states that we have to balance one tree level after another, since they are dependent. This implies that the threads need to communicate between each tree level. One CUDA kernel should therefore balance a complete tree level. The other alternative would be to build the hole tree in one block, which would restrict our kernel to only be executed on one SM. The next step is to split the tree level balance between the CUDA resources. The number of node in a tree level increases with the power of two, as we go down the tree. Fig 3.4.1 shows that our kernel, the tree balance, changes throughout the build process. First only one node needs to be balanced, e.g only one parallel operation. At the end there are m different nodes to work on. The problem size also changes, at the top, it is m per node and goes towards 1, as the tree level increases. This varying problem sizes and subtasks, makes it hard to create a good work distribution between the CUDA resources. At the top part of the tree it is optimal to use many blocks to balance a node, but at the end, it is desirable to balance many nodes inside a block. We choose a middle ground, to balance one node in one block. This means that there should be an overall good performance with a peek at the middle three levels.
3.4.3
Parallelization of Subtree-Balance
With the overall parallelization strategy planned, we can start investigate the most timeconsuming operation, balancing a tree level. We have already determent that the parallelization should be done in one block, which means that one operation is done per SM. In other words, the task can potentially be parallelized between 1024 theoretical threads. Lets start investigating different approaches. The main operation is to find a median. As we have seen in Section 3.3, many algorithms for finding median exist. Since we now want to implement the algorithm with CUDA, the environment has changed, and quick-select may not be the best alternative anymore. The first problem with quick-select is that it is recursive, which makes it hard to parallelize on CUDA. Therefore it may be profitable to look at other, more parallelization friendly, algorithms. First reusing the bitonic sort was investigated. Given a sorted list one can find the median directly, by simply looking at the midmost element of the array. The partitioning is also done in the process. Unfortunately this strategy proved unsuccessful, as re-purposing the bitonic algorithm for such an task proved difficult. The reason for this is that a pure bitonic sorting network only manage to sort lists with a length of power of two. The normal solution is to create a longer list then needed, but wasting this much memory on on the GPU is not a very good solution. 31
Chapter 3. The quest for a faster kNN search Another way to make bitonic sort work with a list of any size is described by K.E. Batcher [6]. The description of this solution is very lengthy, and has been omitted, since it introduces a lot of divergence, that would be decremental to performance on the GPU. We also have the inherent downside of sorting a list in order to find the median, since O m algorithms for finding the median exist, compared to the O mlog(m) time required by sorting. Bucket-sort and radix-sort based algorithms are investigated in the paper Fast K-selection Algorithms for Graphics Processing Units by Alabi et.al. [5]. The big difference between them is the constant time penalty. The radix sort have a more exact time complexity of O b m , where b is the number of bits in each number. While the penalty for bucket select is O a m , where a denotes the degree of agglomeration in the values. In other words, the algorithm is week when the points are clustered together. His results shows that bucket select normally is slightly faster, except when a is high. Although bucket select normally have better results, we expect a high degree of agglomeration in our application, so we choose radix select. Radix select The radix select is based on a bitwise partitioning, much like radix sort [10, Chapter 8.3]. In each step, elements are partitioned in two subgroups based on the current bit. Then the subgroup that contains the median is determined, and the search continue in that subgroup until the median is found.
Figure 3.4.2: An illustration of radix selection [9].
When it comes to create a parallelization strategy for radix select it is first advisable to take a look at a highly optimized radix sort, like the variant introduced by Merrill [17]. The radix select can easily be reduced from a radix sort, and many concepts can therefore 32
3.4 Development of a parallel k-d tree build algorithm be reused. An other interesting implementation is the radix select by Alabi [5]. They both uses a parallelization strategy by splitting each radix partition into parallel operations. The way our and Alabi’s solution differs from Merrill’s solution, is that we start on the most significant digit, since a least significant digit approach will not reduce the k’th order statistic problem in each step.
Our implementation Our implementation is based on Algorithm 3.6. Calling Balance-Subtree on all nodes in a tree-level, is performed on the GPU as a CUDA kernel. The outer for-loop is executed on the CPU, and launches the kernel for increasing tree-levels, until the entire tree is built. The complete implementation can be found in Appendix C.3.4. Algorithm 3.7 is parallelized only within a single CUDA block. This means that the parallel threads are able to communicate. The algorithm is based around a repeat-until loop, which basically do all the work. The loop keeps track of a partition array. This array is then sliced in to, by giving each thread a portion of the array, each thread then counts how many zeros it finds in the current bit position. The cut, as the arrows in Figure 3.4.2 shows, is calculated by doing a reduce sum operation [15]. This is repeated until the partition contains the median, which is when every bit is used or when the partition size is one. After the loop, the array is transformed. Such that the median is in the center, with lesser elements on the left and bigger on the right. Algorithm 3.7 Parallel subtree balance Input: A subtree S of length m, and dimension d Output: A balanced subtree, S function BALANCE -S UBTREE(T , d) Let l and u be the upper and lower bond of current partition. Let P be all nodes in S repeat for all {p ∈ P | l < p < u} do Z(t) ← Occurrences of zeros in current bit, b, found by thread, t. end for c ← S UM -R EDUCE(Z) . c is the cut of the current partition P . if u − c ≥ m/2 then u←u−c else l ←u−c end if b←b+1 . Move to the next bit S YNCHRONIZE -T HREADS( ) until u − l < 1 ∨ b > R ADIX(p ∈ P ) PARTITION(S, P ) end function 33
Chapter 3. The quest for a faster kNN search In CUDA thread instruction run sequentially in warps of 32. Control flow divergence within a warp can therefore significantly destroy the instruction throughput. This is because the different execution paths must be serialized, as all the thread in a warp share the same program counter [2]. The total number of instruction in a warp will therefore increase. Any conditional operator, e.g. if and switch, should be used with care, since it may branch the control flow. Optimizing the use of conditionals, in order to reduce the amount of branching in the program control flow, will give better performance. In our implementation, all threads perform the same thing in every iteration. This will give a low thread branching. The loop is also almost done equal times by all thread, one time for each bit used to represent the points. The one if statement in the iteration, can be reduced to only contain one statement, which make the divergent thread branch small. The code is therefore good in regard of divergence. An other aspect to consider, is the CUDA memory hierarchy. It is beneficial to use the fastest suitable memory. In our case, this include the shared memory, which is a fast memory shared between all threads in a block. The downside is the memory size, it may not be enough space to store our subtree. Although we may not be able to store the hole subtree, there are some date we can put in the shared memory. The zero counter array, shown as Z(t) in Algorithm 3.7, is a perfect candidate. Every thread only use one integer and it is shared between all threads throughout the execution. It will therefore cause a big impoverishment. The complete implementation can be found in Appendix C.3.3.
3.4.4
Further improvements
Let us take a step back and look at RQ 5. The possibility of parallelizing the build algorithm is achieved. Initial optimization has been performed, so according to the APOD design cycle, we should test our implementation. With the test setup as described in Section 4.1, the current version of the algorithm gave results as shown in Figure 3.4.3. The most interesting observation are the big jumps in the graph. If one look closely these jumps happens every time the problem size exceeds a power of two, as for example when the size passes 8388608 the timing increases from 2703ms to 4335ms. The results of further investigation is shown in Table 3.4.1. It shows how long time each tree level takes, and how the different tree level operations varies throughout the build process. The table reveal some weaknesses of our algorithm, that is based around how the CUDA resources was divided in Section 3.4.2. It performs badly when the problem size or the number of subtrees is relatively large. The potential for parallelizing the workload for the first and last iterations is not being fully utilized. This is due to the implementation forcing one version of the radix select algorithm to work on all problem types. This is not optimal for dividing CUDA resources, and as a result, we get high penalties when the problem reaches unsuitable values. 34
3.4 Development of a parallel k-d tree build algorithm
Figure 3.4.3: Timing results from a intermediate version of the parallel k-d tree build algorithm.
Tree level 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Time [ms] 52 26 13 8 7 6 6 6 7 7 8 10 16 26 52 105 202 389 768
Subtrees 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
Size 1000000 500000 250000 125000 62500 31250 15625 7812 3906 1953 976 488 244 122 61 30 15 7 3
Table 3.4.1: Development of a k-d tree build with a million points, showing how the different tree level operations varies throughout the build process.
This hypothesis can also explain the big jumps in runtime. The observations above correlates perfectly with the tree hight, since the hight of a binary tree is the binary logarithm of the tree size. This implies that the jumps happens when an additional tree-level is required 35
Chapter 3. The quest for a faster kNN search in order to fit the tree. Tuning the algorithm to alternate between different algorithms to balance a subtree, eliminates this problem. This removes the penalty for calculating the median at unsuitable problem sizes. Our current implementation only use one block per subtree, which means we are only able to use one SM to balance the subtree. By utilizing several blocks at the same time, big subtrees could be processed faster. However, since all blocks are used to balance one subtree, only one subtree can be balanced at a time. The implementation details are omitted here, but can be found in Appendix C.3.1. In short, the implementation follows the outline of the radix-select implementation, but the thread synchronization is performed on the CPU, between kernel launches. This enables us to communicate between the different blocks. We also would like to improve performance, when Balance-Subtree is applied to many small subtrees. For instance, at level 18 there are 131072 different subtask with only an average size of 7. The previous implementation divided all these subtask between a small number of SM, typically 8 − 32 on current NVIDIA GPUs. The algorithm the uses to many cores to balance a subtree of only 7 elements, which is not a efficient way to divide resources.
Figure 3.4.4: Visualization of the final improvments on the k-d tree build implementation.
Letting just one thread handle the Balance-Subtree operation for small subtrees, would let us process more subtrees in parallel, improving performance. With this parallelization, each thread is responsible for it’s own subtree, and communication with other threads is no longer needed. With the need for communication eliminated, we can utilize Algorithm 3.4. Now that a lot of improvements have been made, lets take a look at the results. Figure 3.4.4 compare our intermediate result with our new and improved version, and the changes made 36
3.5 Development of a parallel k-d search algorithm a huge impact on the performance. The jumps disappeared, and was replaced by a faster and smoother curve, indicating that the CUDA resources is perfectly balanced. The final implementation can be found in Appendix C.3.2.
3.5
Development of a parallel k-d search algorithm
In this section, we will investigate parallelization of the k-d tree based query algorithm, applied to the All-kNN problem. The following research question is stated: RQ 6. It is possible to parallelize the All-kNN query algorithm, in such a way that it gives a significant speed improvement compared to the serial algorithm. To investigate RQ 6, we will device a parallelization strategy, rewrite the k-d search algorithm as a iterative algorithm and optimize our implementation for CUDA execution. Finally results obtained from testing this implementation is presented.
3.5.1
Parallelization strategy
Solving the All-kNN problem, can be done by repeated application of the kNN query algorithm. This is an algorithm that is easily parallelized, by distributing individual kNN queries across the available parallel units. However, there are still some possible pitfalls to address. Should a query be done in one block, maybe each query should be done single handedly by one thread, or maybe we should use one thread per k in a query. We have previously determined that a single query on a k-d tree size m, will in average visit log(m) nodes. This indicates that not a lot of GPU resources is needed to perform a individual query. Assigning an entire CUDA block to one query therefore seems excessive. Combined with the communication heavy nature of the query algorithm, the best parallelization strategy is therefore to use one thread per query and equally distribute the queries amongst the GPU’s SM. Let us now consider if we can use Algorithm 3.5 directly with this parallelization strategy. As discussed in previous sections, GPUs and recursion don’t get along well, the main drawback being the inherent need for communication between the recursive calls. Unfortunately, even with our current parallelization strategy, there are still reasons not to use a recursive algorithm. The GPU threads are lightweight, with restricted available memory and cache. This means that the call stack, where the all program instructions are managed, is relatively small. Given a non tail-recursive algorithm, the program context and instructions are appended to the call stack at each recursive call. This will eventually fill up the limited call stack available for an individual CUDA thread. It might be possible to still use a recursive algorithm, given that the call stack never gets to large. 37
Chapter 3. The quest for a faster kNN search To determine if the recursive k-d tree query algorithm can fit within a CUDA thread, call stack tests was performed. On a individual block, 64 theoretical threads was spawned, each querying a k-d tree of increasing size. The results showed that when the k-d tree size passed 1 · 105 points, unknown errors started to appear. This indicated a stack overflow. Divergence also needs to be considered. In a recursive algorithm the decision of whether a recursive call should be made or not, is entirely up to a single thread. Once two threads have made different decisions, there is no guaranty that they will stay synchronized. Both problems would be solved by rewriting Algorithm 3.5 into an iterative algorithm.
3.5.2
From recursive to iterative implementation
To rewrite Algorithm 3.5 into a iterative algorithm by explicitly managing the recursion stack, some properties about how the search traverse the k-d tree is needed. From Algorithm 3.5 one can see that this is a variant of the depth-first traversal, since the work of the current node is done before and between the recursive calls. This traversal in also the best strategy in a binary tree search, because the pruning of subtrees is maximized. How to make a standard binary search tree in an iterative fashion is described in Cormen [10, Chapter 12], but since this is a k-d tree search the implementation is slightly different, as shown in Algorithm 3.8, The algorithm works in the same way as the recursive algorithm, but adds a stack, S, called the s-stack, and a while loop in order to handle the tree traversal iteratively. While there is a element assigned to the root variable, r, the algorithm will traverse down the target branch, updating the dimension, i, calculating the distance, dx, determining the target, t, and other, o, child node. Then it will collect r, o, i and dx into one element, and push it on the s-stack. Finally the root variable is assigned to the target child, or NIL if we have reached the end of a branch. While there still is elements in the s-stack, but r is assigned to NIL, we are traversing back up a branch. While this is happening, the algorithm pops elements from the s-stack, determines if they should be added to the k-heap, before it determines if it need to investigate the other branch of this node. If that is the case, the other node is assigned to r, and the algorithm will traverse down this subtree using the previously stated rules.
3.5.3
CUDA implementation
Our simple parallelization strategy, combined with an iterative implementation of the k-d tree search algorithm, resulted in a trivial CUDA implementation, as we did not need to parallelize the iterative search algorithm itself. The implementation can be found in it’s entirety in Appendix C.4.1. In addition, we will highlight some implementation details, and look at the results obtained from this code. 38
3.5 Development of a parallel k-d search algorithm
Algorithm 3.8 Iterative kNN k-d tree search procedure I TERATIVE - K NN-KD-T REE(K, r, q) Let S be a stack for collecting tree nodes i←2 while !S.empty or r != NIL do if r = NIL then r ← P OP(S) i ← r.dimension if r.dx2 < K.max then . Can there be closer points in the other subtree? r ← r.other else r ← NIL end if else d ← D ISTANCE(r, q) if d < K.max then . Is r closer to q than the current k best points? r.distance ← d I NSERT(K, r) end if i ← (i + 1) mod k . k = 3 for a three dimensional k-d tree r.dimention ← i r.dx ← r.x(i) − q.x(i) if r.dx > 0 then . Select t and o so we traverse towards closest point first t ← r.lef t, r.other ← r.right else t ← r.right, r.other ← r.lef t end if P USH(S, r) r←t end if end while end procedure
39
Chapter 3. The quest for a faster kNN search Algorithm 3.8 does not have a lot of divergence, and the remaining branching can be further reduced. If threads in a warp is traversing completely different parts of the tree, they will access different nodes. This is called data divergence. The solution is to let each warp search for points that are located closely in the k-d tree. This will cause all the threads to traverse down the tree in roughly along roughly the same branch, reducing the data divergence. Due to the nature of our k-d tree implementation, this can be achieved by feeding the points to the search algorithm as they are placed in the k-d tree. The explicit stack also makes an interesting question about where to store the new stack. This is data that are modifiable and thread independent, which means that the possible options memory options are shared memory, local memory and global memory. Local memory is the memory each thread can allocate dynamically from the heap. Global memory is a possible candidate. It has enough space, it is modifiable and accessible to all threads. The drawback is the access time, it takes around 400 − 600 clock cycles[2], and it would therefore be beneficial to use some other kind of memory. Shared memory would be a perfect candidate, because the memory is fast and the need to communicate between blocks is nonexistent. The only drawback is the amount of data available in shared memory, which is around 49kb on current NVIDIA GPU’s. The iterative search algorithm uses one stack and one heap, both stored as arrays in memory. Th number of arrays are dependent on the number of threads used in each block. The s-stack array size is dependent on how many elements the depth-first tree traversal needs to store. If one looks on how the algorithm handles the stack, one can see that elements are pushed on the way down, and poped on the way up. This means that the stack never will be longer then the tree hight. One stack element uses 16 bytes of space, which means that the stack memory is s subset of Θ(16 log2 (n)T ). Here T represent the number of threads and n is the k-d tree size. The k-heap array size, depends on the number of closest neighbors, k, and one element uses 8 bytes. This implies that it’s memory usage will be, Θ(8kT ). In Figure 3.5.1 the memory usage of each stack is compared to the available shared memory. Some basic assumptions and approximations have been done in regard to the data. Treads are only compared in multiples of 32, since this is the warp size and is therefore the most optimal thread numbers. The value of k is dependent on the problem in hand, and as our application only needs a value of 100, so that value is used. Figure 3.5.1 shows that the k-heap will not fit in shared memory. Already at a thread count of 64 the memory is filled up. The size of the k-heap is also highly dependent on the size of k which is hard to predict. However, locating the s-stack on shared memory looks promising. The memory size has also a relatively low asymptotic growth, O log(n) , in regard to the tree size. To decide what kind of memory is optimal location for the s-stack, Figure 3.5.2 has been created. Surprisingly shared memory looks like the slowest alternative. One likely reason is that elements in shared memory is synced between all threads in a block. This property is not needed in the s-stack, since the s-stack is only used by one thread. Although global and local memory presumably is stored at the same place, the are some 40
3.5 Development of a parallel k-d search algorithm
Figure 3.5.1: The stacks memory usage, compared to the amount of shared memory. Here k was sat to 100.
noticeable differences that can explain the time gap between them. The cache may be a factor. The cache is placed on the same on-chip memory as the shared memory, and should therefore be equally fast. The difference is that cashing is not programmable and therefor not controlled be the programmer. However some properties in the local memory may suggest that it is a more likely candidate to be cached. The local memory is thread dependent and is not accessible to other threads or blocks as the global memory are. The compiler can therefor logically imply that the data is not going to be modified by other threads and caching becomes much more likely. Figure 3.5.1, also shows us that the cache can fit the hole s-stack in a block, which correlates with the timing results. To enforce cache use even further, CUDA gives a runtime option to enforce more of the on-chip memory to caching. Testing different memory locations for the s-stack, showed that the memory location is important for performance. Even small improvements in the performance of the s-stack, gives a significant improvement in overall runtime. This implies that the k-heap should be highly optimized. Figure 3.5.3 shows the runtime difference between two k-heap variants. One uses a bobble sort [10] like implementation. It works by always keeping a sorted list. Elements are inserted by placing it at the end if the list, and swapping it to the adjustment element, until it is in the right place. The other method is based on a heap sort implementation, that is explained in Section 3.3.3. The performance difference is the insertion time, where the bobble variant is a O n time complexity, while heap sort variant has a O log2 (n) . Resulting in a almost 7 times faster k-heap, with only a k value of 100. 41
Chapter 3. The quest for a faster kNN search
Figure 3.5.2: Search time comparison between different stack memory types. The test are done with k equals 10 and 1 · 106 queries per tree size.
Figure 3.5.3: Timing results from two different k-heap implementations, with varying k and 1 · 106 referance points.
Open-MP The high impact the stack had on performance make an interesting question in regard to RQ 6. Could a parallel implementation on the CPU outperform the CPU version? When the latency effect, as the stacks showed, had such a huge impact on the performance. The CPU has a lot more cache then the GPU and would therefor not be affected that much by memory overhead. The question if this is enough to offset the lower number of parallel threads on the CPU. 42
3.6 CUDA Optimizations For this to be investigated properly, an OpenMP version of the k-d tree search has to be created. The parallelization strategy is the same as for the CUDA implementation, only differing in implementation details. Since the CPU has vastly more cache and the system memory latency is not very high, we do not need to consider memory usage in the same manner as with the CUDA implementation. The implementation can be found in Appendix C.4.2.
3.6
CUDA Optimizations
Throughout this quest many optimizations have been done, some focusing on the algorithmic aspect, others more on implementation. This section is focusing on different CUDA optimizations and why it is necessary in regard of performance. Some performance considerations, like divergence, have already been mentioned, due to it’s direct relation to the different implementations. Other important factors, occupancy, coalescing, loop-unrolling, block and thread load balancing. Occupancy is a metric, which relates to how many active warps there are on a SM. Earlier we have talked about how thread instructions are executed sequentially, resulting in alternating warps, one warp is paused while the other is executing. The time a stalled warp will use to retrieve data, increases with the number of warps per SM. One should note that high occupancy does not always result in high performance, but low occupancy will always result in an inability to hide latency, which result in bad performance. The struggle to always have the right amount of occupancy, also relates to dividing CUDA resources. As well as keeping a right amount of warps in a SM, one must also keep every SM in activity. Forcing the algorithm to work over unsynchronizable blocks. The number of blocks, that are optimal to keep in activity, changes with different GPUs. It is therefore important to think of how many blocks and threads that are launched with each kernel. We have solved this issue with methods that, based on different algorithmic parameters, calculates how many threads and blocks are needed for a particular launch. To coalesce memory access to global memory, is probably one of the most performance increasing optimizations in CUDA, especially in our memory intense application. Global memory that is loaded and stored by threads in a warp, can be coalesced into only one transaction, if the right conditions are met. How a device coalesce memory depends on the compute capability, but some basic properties are common. A warps access will coalesce into onto a number of transactions that equals the number of cache lines needed to service all the threads in the warp. Devices with compute capability 2.x will by default cache directly to L1, which has 128-byte lines. Higher capabilities will always cache to L2 cache, that have 32-byte segments [2]. If we focus on compute capability 2.x, Figures 3.6.1, shows have memory are coalesced. Green indicate memory lines that are retrieved, while blue indicate non retrieved lines. The first figure illustrate perfect coalescing, a warp performs a sequential 128-byte transaction that fit perfectly in a 128-byte lines. The second shows a misaligned sequential retrial, 43
Chapter 3. The quest for a faster kNN search
Figure 3.6.1: Three different memory transactions, where A and B result in good coalesced and cached transactions, while C shows a stride access pattern with bad coalescing.
resulting in two transactions. The third uses stride access pattern with a offset of 128 resulting in bad coalescing. We have tried to maximize coalescing by always using sequential addressing. This kind of addressing can be achieved in many ways. One way, that we have use throughout the code, is based on how data are partitioned and iterated. The generic partitioning algorithm, Algorithm 3.1, that we used to expand Garcia’s algorithm, shows how this could be done. The last optimization keyword we would like to introduce is unrolling. This is a technique we have used on many of our algorithms, like min-reduce, and also in some of our utility functions, like for example to accumulate an array. Unrolling is a standard technique in ordinary high performance serial programming, optimizing pipelining, and is given an extra dimensions on CUDA. Loop unrolling, is the procedure of rewriting a loop, containing conditional operators, into hard-coded sequential steps. This way, the result of conditional operators may be determined at compile-time, eliminating branching of the control flow. On a CUDA context this is of course the case, but in addition it will minimize divergence. The idea of loop unrolling can also be applied to warps. This is called warp unrolling, and it can be used if we know we are in a single warp. The results being, that no expensive thread synchronization is needed, since every warp is accessing a unique memory location.
44
Chapter
4
Final results, discussion and conclusion 4.1
Test environment
To test and investigate our results three test environments have been used, as shown in Table 4.1.1. They all have different properties that are used to test different aspects of our algorithms. Test environment 3 has a normal Windows setup, with a graphics card that is used for both display rendering and CUDA computation. On the other hand, test environment 1, uses the same hardware, but is setup with a dedicated GPU. CUDA computations done on a environment without a dedicated GPU, is forced to split the resources with the running OS, which implies some restrictive properties. For instance, it is normal for an OS to kill long running GPU processes. On windows the normal timeout is around 30 seconds. To get around this restriction some editing in regedit is necessary. The last environment, number 2, is build on an Amazon web service (AWS) cloud instance. Using AWS, enables us to test implementations on a more powerful GPU, than what we currently have in our personal procession. The biggest advantage of the AWS GPU for our applications, is the increased on-board GPU memory. This enables us to store larger point clouds directly on the GPU. The increased number of CUDA cores, also makes it possible to test performance ratios, and different variants of resource distributions. All tests was run multiple times and averaged, to get more accurate results. It should also be noted that it is common practice to warm up the GPU before any test, because it may take as long time for the CUDA runtime to create a CUDA context, as launching the kernel itself. All tests used synthetic data, generated as uniformly distributed random points in a unit cube. Where needed, random points was generated and stored to disk. This data could 45
Chapter 4. Final results, discussion and conclusion Test envoronment OS OS type Kernel CPU CPU memory GPU GPU memory Dedicated GPU CUDA cores CUDA capability CUDA driver CUDA runtime
1 Ubuntu 14.04 x64 3.13.0-24-generic i7-2600K 7.8 Gb GeForce GTX 560 Ti 1024 Mb Yes 384 2.1 5.5 5.5
2 Ubuntu 12.04 x64 3.2.0-58-virtual E5-2670 16 Gb NVIDIA GRID K520 4095 MB Yes 1536 3.0 5.5 5.5
3 Windows 7 x64 Windows 7 i7-2600K 7.8 Gb GeForce GTX 560 Ti 1024 Mb No 384 2.1 5.5 5.5
Table 4.1.1: Tabulated information about the three test environments.
then be used as the source for several different tests, eliminating deviation in tests used for comparison of between different implementations. Using uniformly distributed data is not necessarily a good representation for all real world point cloud data, but this should not affect our results for the brute-force and k-d tree build algorithm, given that these algorithms have a non-stochastic runtime, not affected by the location of our test points. Given that we always balance the k-d tree, the k-d query algorithm should not be significantly affected by changing the point distribution, although slight deviations might be observed. Due to the reasons above, and the available time for this thesis, we have chosen not to include other point distributions in our tests, but this could be an interesting study for further work.
4.2
Final results and discussion
During the development of our algorithms, we have presented many intermediate results, in order to argument for the design and implementation choices made. In this section, we will present our final results for the GPU parallelized brute-force, GPU parallelized and CPU parallelized k-d tree based kNN algorithm. The different research questions stated during development of our algorithms are revisited, and answered are given, based on the presented results.
4.2.1
Solving the kNN problem
In Section 3.2 we started on our quest for a faster kNN search, by investigating a possible brute-force algorithm, pioneered by Garcia et.al. The following research question was 46
4.2 Final results and discussion asked. RQ 1. Can high performance be achieved by a parallel brute-force kNN algorithm on large point clouds. In order to discuss this question, we have to clarify what we consider to be high performance in this context. The work of Garcia et.al. contains the fastest brute-force algorithm, for solving the kNN problem, we have found in current literature. We would therefore consider a kNN algorithm to have high performance, if it is able to solve the kNN problem for point cloud data, in the 3D format specified by TSI, at comparable speeds to the algorithm developed by Garcia et.al. Although the algorithm developed by Garcia et.al. is the fastest brute-force algorithm we have managed to find, this is not as a high benchmark as it might initially seem. The algorithm developed by Garcia et.al. is optimized for solving the kNN problem for a more general version of the kNN problem than required by TSI. Where we are only concerned with solving the kNN problem for three dimensions, Garcia’s algorithm will solve problems stated in any dimension. Given that we solve a more restricted version of the kNN problem, any less than comparable speeds to the implementation made by Garcia et.al. could not be considered to be of high performance in our eyes.
Figure 4.2.1: Comparison of brute-force algorithm developed by Garcia et.al. and min-reduce based brute-force algorithm developed in this thesis. k is fixed at 10 and m is increasing.
In Section 3.2.2, we discussed Figure 3.2.2. As a reminder, part of this figure is presented as Figure 4.2.1, which shows the result of a comparison between the brute-force algorithm developed by Garcia et.al. and our min-reduce brute-force algorithm developed in Section 3.2.2. The test is performed with a low value for k = 10, and focuses on performance for large values of number of points, m. In this test, our min-reduce brute-algorithm is 47
Chapter 4. Final results, discussion and conclusion shown to be almost 70 times faster than the algorithm developed by Garcia et.al. In addition, it is capable of solving the kNN problem for much larger values of m. We therefore conclude that RQ 1 can be answered with yes. High performance can be achieved with a brute-force based algorithm. In Section 3.3 we introduced the k-d tree, a data-structure with a known O log(m) nearest neighbor query algorithm. We therefore presented a new research question, which we try to answer in Figure 4.2.2. RQ 3. It is possible to use a k-d tree to increase the performance of kNN queries, compared to a parallel brute-force solution?
Figure 4.2.2: Comparison of min-reduce brute-force and k-d tree based algorithms for solving the kNN problem for k = 100 and increasing m ≤ 1e7.
The graph compares the runtime of the k-d tree build and query algorithms developed in Section 3, with the min-reduce brute-force algorithm. All test data is generated using test environment 1. Figure 4.2.3 compares the same two algorithms, but for a fixed value of m, and increasing value of k. This test is also performed using test environment 1. Both Figure 4.2.2 and 4.2.3 shows that this algorithm is slower than a brute-force approach. This answers RQ 3. In order to solve the kNN problem, the k-d tree based solution first has to construct the k-d tree, then query for the closest points. Given that our previous calculation shows that the time complexity of the k-d tree build algorithm is larger than the time-complexity for the brute-force algorithm, this result should come as no surprise. Constructing the k-d data-structure for a single query is simply a waste of resources, if just one kNN query is to be performed. 48
4.2 Final results and discussion
Figure 4.2.3: Comparison of min-reduce brute-force and k-d tree based algorithms for solving the kNN problem for m = 1,0 · 106 and increasing k ≤ 200.
4.2.2
Solving the All-kNN problem
The All-kNN problem was studied in both relation to the brute-force algorithm and the k-d tree based algorithm. In Section 3.2.2 we proposed RQ 2. We also wanted to compare our parallel k-d tree based algorithm to the brute-force algorithm, postulated in RQ 4. RQ 2. Can a parallel brute-force kNN algorithm be fast enough to solve the All-kNN problem within reasonable time? RQ 4. It is possible to use a k-d tree to increase the performance of All-kNN queries, compared to a parallel brute-force solution? Figure 4.2.4 compares the runtime of the min-reduce brute-force algorithm to the k-d tree based algorithm, for increasing values of m ≤ 1e7, and a fixed value of k = 100. The data series for the min-reduce algorithm is estimated from the data obtained in Figure 4.2.2. This estimation is valid, since all GPU resources are used to perform one kNN query, when using the brute-force algorithm. Solving the All-kNN problem with the brute-force algorithm, can therefore only be performed as repeated application of the brute-force kNN algorithm, given a reasonable hardware setup, with one CUDA enabled GPU. The data series for the k-d tree based algorithms are, on the other hand, generated from actual runtime results. This is due to the k-d query algorithm being developed as a parallelized Q-kNN query, where individual kNN queries are parallelized, instead of one single kNN query, as is the case with the brute-force algorithm. We have developed two different implementations of this k-d tree based Q-kNN query. One parallelized on the GPU, anno49
Chapter 4. Final results, discussion and conclusion tated with GTX k-d build + m queries, and one parallelized on the CPU, annotated with CPU k-d build + m queries. All tests was performed using test environment 1.
Figure 4.2.4: Comparison of min-reduce brute-force and k-d tree based algorithms with CPU and GPU parallelized query. The graph compares runtime for solving the All-kNN problem for k = 100 and increasing m.
Figure 4.2.5 compares the GPU parallelized k-d tree algorithm, with the parallel bruteforce algorithm, for a fixed value of m, and increasing values for k ≤ 200. Test environment 1 is also used in this comparison. Both graphs clearly shows the benefit of using a k-d tree based algorithm for solving the All-kNN problem. As discussed in Section 3.3.4, part of this increased performance over a brute-force based solver, is due to the k-d query algorithm being able to execute in O log(m) time. Since we do not have to rebuild the tree between the individual kNN queries when performing a All-kNN query, the reduction in query runtime has larger impact on the overall execution of the algorithm. Another important factor is that each k-d kNN query, is parallelized. This can be done for a k-d based query algorithm, since the resource requirements for each individual kNN query is lower than for the brute-force algorithm. These two improvements combined, result in an algorithm that is capable of solving the All-kNN problem for values of m and k, that would not be feasible with a brute-force algorithm. This answers RQ 4. Figure 4.2.4 shows that the brute-force algorithm is capable of solving the All-kNN for small point clouds, although significantly slower than the k-d tree based algorithm. With a point cloud containing just 10000, the brute-force algorithm will take at least 10 s to execute, compared to the 100 ms required by the k-d tree based algorithm. Although slow, for low values of m, the brute-force algorithm computes the answer within arguably 50
4.2 Final results and discussion
Figure 4.2.5: Comparison of min-reduce brute-force and GPU parallelized k-d tree based algorithms for solving the All-kNN problem for m = 1e6 and increasing k ≤ 200.
reasonable time. When m is increased, this changes. For a point cloud of size 1e7, the brute-force algorithm would require about 3e9 ms to compute the answer, or almost 34 days. This would most certainly not be considered to be within reasonable time. The answer to RQ 2 is therefore dependent on the size of m. It can be argued that the bruteforce algorithm is capable, but not the best alternative, for solving the All-kNN problem for small values of m. In Figure 4.2.4 the impression is that the GPU and CPU parallelized k-d algorithms performs similarly. We will therefore investigate additional results, to get a better understanding of how they compare. Figure 4.2.6 compares the difference between the CPU and the GPU parallelized k-d query algorithm. Test environment 1 is again used. In Figure 4.2.6 the CPU based parallelization is slower than the GPU based parallelization. Although smaller than the difference between the brute-force algorithm and the k-d tree based algorithm, it is still a significant difference. This indicates, that for this algorithm, the benefit of having faster individual cores, and less overhead related to memory transfer on the CPU, is not enough to offset the drawback of the CPU has a lot fewer parallel cores than the GPU. Using the GPU parallelized version, where possible, is therefore recommended. 51
Chapter 4. Final results, discussion and conclusion
Figure 4.2.6: Comparison of runtime for GPU (GTX 560) and CPU (OpenMP) parallelized k-d tree based n query.
4.2.3
Parallelization performance increase
Parallelization of the k-d tree build was introduced with RQ 5, in Section 3.4. RQ 5. It is possible to parallelize the k-d tree build algorithm, in such a way that it gives a significant speed improvement compared to the serial algorithm. This research question is based around the complex nature of the k-d tree build, and the uncertainty of it achieving a acceptable parallel speedup. This question was investigated though implementation prototypes, together with a thorough discussion about the parallelization strategy and the intermediate results. Figure 4.2.7 tries to answer RQ 5, by comparing the serial and parallel k-d tree build implementation. Both graphs follows the same trend, which correlates with the shared time complexity of O m log(m) . We see that the impact of the parallel overhead is decreasing as the problem size increase, and the profit of multiple cores is getting more and more be dominant. Resulting in a faster parallel implementation. To get a better picture of the parallel improvement, it is natural to talk about parallel speedup. Figure 4.2.8 shows how the parallel speedup develops, as the problem size increase. Here we see that the speedup starts below 1, indicating that the serial version is faster then the parallel version, but from Figure 4.2.7 one can see that the time to build such small k-d trees is almost negligible. As the problem size increase, the trend quickly changes, until the speedup flattens out. The speedup increases as the problem size allows utilization of more and more threads, until the limit is reached, and the curve flattens out 52
4.2 Final results and discussion
Figure 4.2.7: Comparison between serial and parallel k-d tree build performance.
into a lower gradient.
Figure 4.2.8: Parallel speedup for the k-d tree implementation for varying values of m.
With the complex nature if the k-d tree build process, a speedup of three is acceptable, and we consider overall performance increase to be a significant compared to the serial 53
Chapter 4. Final results, discussion and conclusion algorithm, answering RQ 5. Parallelization of the All-kNN query was introduced with RQ 6, in Section 3.5. RQ 6. It is possible to parallelize the All-kNN query algorithm, in such a way that it gives a significant speed improvement compared to the serial algorithm.
Figure 4.2.9: Comparison between serial and parallel All-kNN query performance.
Figure 4.2.9 display the results from the two different parallel All-kNN query implementations, CUDA and OpenMP, compared to the serial version. The linear trend, also found in the k-d tree build algorithm, is not surprising, as the time complexity for all algorithms are O m log(m) . The parallel improvement is only shown in the gradient these slopes have, which is reasonable, because the work is only divided amongst more cores. In both OpenMp and CUDA the parallel improvement is significant. If we look at the parallel speedup, shown in Figure 4.2.10, we can again conclude that the OpenMP version is outperformed by the CUDA implementation. The trend resembles what we saw in the k-d tree build parallelization, only this time the speedup goes towards 50 in the CUDA version. This correlations well with the discussion in Section 3.5.1, and we can answer RQ 3. Our All-kNN query has a significant parallel improvement. An final note, is that the speedup for the k-d tree based All-kNN algorithms are lower than the speedup for both our and Garcia’s[13] brute-force implementations, which shows that speedup don’t equal a fast implementations for this problem. 54
4.3 Conclusion
Figure 4.2.10: Parallel speedup comparison for the All-kNN query between the CUDA and OpenMP implementation.
4.3
Conclusion
In this thesis, we have investigated the possible benefits of using general-purpose computing on graphics processing units, in order to speed up the execution of calculations in engineering applications. We have investigated this topic, by improving the performance of point cloud analysis in engineering software developed by TechnoSoft Inc. By utilizing the parallelization possibilities offered by CUDA enabled GPUs, and optimizing our algorithms for 3D point cloud data, we have been able to develop fast algorithms for solving the kNN and All-kNN problem. The parallel brute-force algorithm developed in this thesis, is 70 times faster than the bruteforce algorithm developed by Garcia et-al. [13] on comparable problem sizes. Considering the algorithm developed by Garcia et.al. is significantly faster than conventional libraries, being up to 407 times faster than Matlab, and up to 148 times faster than ANN [13, Table 1], this is a notable result. A parallel k-d tree based Q-kNN algorithm has also been developed in this thesis, and optimized for solving the All-kNN problem. The parallel k-d tree based algorithm is able to solve the All-kNN problem 300 times faster than the parallel brute-force implementation, and this could enable All-kNN analysis of much larger point clouds than was previously feasible. In addition, all algorithms has been implemented with memory scalability in mind, resulting in a finished library of algorithms, which solves kNN and All-kNN problems faster, 55
and for larger point clouds, than other alternatives, known from literature. This library is currently being integrated into point cloud analysis software at TechnoSoft Inc. In conclusion, our results indicate that large runtime improvements can be achieved in engineering software, by utilizing the parallel performance of GPUs to speed up timeconsuming algorithms.
56
Bibliography
[1] Automatically tuned linear algebra software (atlas) homepage. (available online at http://math-atlas.sourceforge.net/), 2014. [2] Cuda c best practices guide. (available online at http://docs.nvidia.com/ cuda/cuda-c-best-practices-guide/), 2014. [3] Cuda c programming guide. (available online at http://docs.nvidia.com/ cuda/cuda-c-programming-guide/index.html#axzz32C74scPM), 2014. [4] Homepage cublas. (available online at https://developer.nvidia.com/ cuBLAS), 2014. [5] T. Alabi, J. D. Blanchard, B. Gordon, and R. Steinbach. Fast k-selection algorithms for graphics processing units. J. Exp. Algorithmics, 17:4.2:4.1–4.2:4.29, Oct. 2012. [6] K. E. Batcher. Sorting networks and their applications. In Proceedings of the April 30–May 2, 1968, Spring Joint Computer Conference, AFIPS ’68 (Spring), pages 307–314, New York, NY, USA, 1968. ACM. [7] J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517, Sept. 1975. [8] S. Brown and J. Snoeyink. Gpu nearest neighbor searches using a minimal kdtree. In Proc. MASSIVE, June 2010. (available online at http://cs.unc.edu/ ˜shawndb/). [9] J. S. K. T. S.-H. C. Cayman Mitchell, Nelson Schoenbrot. Radix selection algorithm for the k order statistic. may 2012. [10] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. 57
[11] J. Fang, A. L. Varbanescu, and H. J. Sips. A comprehensive performance comparison of CUDA and openCL. In Proc. International Conference on Parallel Processing (40th ICPP’11) USB, pages 216–225, Taipei, Taiwan, Sept. 2011. IEEE Computer Society. [12] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209–226, Sept. 1977. [13] V. Garcia, E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using gpu. CoRR, abs/0804.1448, 2008. [14] V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud. K-nearest neighbor search: Fast gpu-based implementations and application to high-dimensional feature matching. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 3757–3760, Sept 2010. [15] M. Harris. Optimizing parallel reduction in cuda. (available online at http://developer.download.nvidia.com/assets/cuda/files/ reduction.pdf), 2014. [16] T. Karras. Collision detection on the gpu. 2012. [17] D. Merrill and A. S. Grimshaw. High performance and scalable radix sorting: a case study of implementing dynamic parallelism for gpu computing. Parallel Processing Letters, 21(2):245–272, 2011. [18] G. Nolan. Improving the k-nearest neighbour algorithm with cuda. 2009. [19] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007. [20] Y. L. L. J. Shenshen Liang, Cheng Wang. [21] K. Zhou, Q. Hou, R. Wang, and B. Guo. Real-time kd-tree construction on graphics hardware. ACM Trans. Graph., 27(5):126:1–126:11, Dec. 2008.
58
Appendix
A
Data sheets
59
Variable referance points
k=100
20 66.267632
37.733423
10.330717
68733
5458
553
92
24
124203
11728
1069
104
16
126312
11941
1128
135
20
299
31.66
4.36
1.32
0.65
2990000000
31660000
436000
13200
(ms)
61 223.556185
(ms)
6.33 494 2153.850982
(ms)
6.73 5245
(ms)
4 7.27 66624
(ms)
31 10.56
44.47341
1.00E+07
(ms)
59 44.85
1.00E+06 42.81011
(ms)
213
8.71837 48.44749
(ms)
2109
1.00E+05 13.06278 48.46307
(ms)
1.00E+04 6.26602 8.82883 43.21453
650
5.86922 7.0743 11.42278 42.84016
GTX brute-force m queries
3.79834 7.39702 12.06602 46.37597
GTX brute-force query
6.18493 8.45648 9.91392
43.52374
CPU k-d build + m queries
8.53898 8.00269 9.98352
44.89286
CPU k-d m query
8.36416 8.07635 10.94211
43.46848
GTX k-d build + m queries
6.95187 4.98074 9.73318
GTX k-d build + query
6.19469 9.74704 10.89034
GTX k-d m query
7.76026 5.768
6.90768
44.850982
GTX k-d query
6.17837
10.556185
GTX k-d build
7.49341 7.267632
6.733423
k = [1,50,100,150,200]
17 1,70E+07
(ms) 244
31 3,10E+07
GTX brute-force m queries 2797
45 4,50E+07
(ms)
21,919,076
5456
60 6,00E+07
(ms)
219,752,383
8220
(ms) 34 219,860,921
10862
(ms)
2587
220,707,364
(ms) 919,076 5246
221,725,212
(ms)
9,752,383 8010
3 3,00E+06
210 9,860,921 10652
GTX brute-force query
210 10,707,364
GTX k-d build + qu GTX k-d build + m queries
1,00E+06 210 11,725,212
GTX k-d m query
1,00E+06 210
GTX k-d query
1 1,00E+06 210
GTX k-d build
50 1,00E+06
m
Data variable k BF vs GPU
k
100 1,00E+06
106,527
1,066,054
150
1,293,373
200
1,208,762
961,306
1,325,232
200
841,485
1,053,635
1,002,586
1,280,058
1,135,574
765,181
1,100,941
150
959,165 914,694
1,080,563
923,587
1,203,862
1,072,627
913,853 885,469
1,078,723
980,038
1,001,293
100.00
92,528
996,234
825,738 980,426
852,426
2 878,813 1,035,363
1,209,158
153,128
1,232,989
50.00
3 947,264 985,597
997,805
1,067,658
4 984,557 1,120,781
969,376
1.00
5 936,474
891,686
837,421
6 784,122
1,153,219
1
7
899,946
1,179,101
Series
8 828,832
9 1,071,763
10
10,707,364
11,725,212
9,860,921
919,076
9,752,383
Average
62
Appendix
B
Source code documentation B.1
Api documentation
63
KNN GPGPU Documentation Includes point.h Contains definitions of the different point struct data-types used by the knn gpgpu algorithms.
Members void buildKdTree(struct Point points, int n, struct Node tree) Accepts a list of n PointS. Builds a balanced kd-tree from these points on the GPU, and writes this tree to the Point list tree. void queryAll(struct Point query_points, struct Node tree, int n_qp, int n_tree, int k, int *result) Queries a previously built kd-tree of size n_tree for the k closest neighbors to the points specified in the query_points list of size n_qp. The index location of the k closest points are written to the result array. Uses a wrapper to partition the problem, in order to handle memory overflow. void cuQueryAll(struct Point query_points, struct Node tree, int n_qp, int n_tree, int k, int *result) Queries a previously built kd-tree of size n_tree for the k closest neighbors to the points specified in the query_points list of size n_qp. The index location of the k closest points are written to the result array. void mpQueryAll(struct Point query_points, struct Node tree, int n_qp, int n_tree, int k, int *result) Performes same operations as cuQueryAll, but is parallelized on the CPU using OpenMP instead of CUDA. void knn_brute_force_garcia(float ref_host, int ref_width, float query_host, int query_width, int height, int k, float dist_host, int ind_host) Performs a brute force knn-search based on the code written by Garcia. void knn_brute_force(float ref_host, int ref_nb, float query_host, int dim, int k, float dist_host, int ind_host) Performs a improved brute force knn-search.
1
Utils size_t getFreeBytesOnGpu() on the GPU in bytes.
Return the current amount of free memory
void cuSetDevice(int device) Sets device as the current device for the calling host thread. int cuGetDevice() Returns the device on which the active host thread executes the device code. int cuGetDeviceCount()
Returns the number of devices accessible.
size_t getNeededBytesForBuildingKdTree(int needed bytes on GPU to build a tree of size n_tree.
n_tree)
Returns
size_t getTreeSize(int n_tree) Returns the size in bytes of a tree with length n_tree. size_t getNeededBytesForQueryAll(int n_qp, int k, int n_tree) turns needed bytes on GPU to perform a queryAll operation on CUDA.
Re-
size_t getNeededBytesInSearch(int n_qp, int k, int n_tree, int thread_num, int block_num) Returns needed bytes on GPU to perform a queryAll operation on CUDA without taking the tree size into account.
2
B.2
66
Installation instructions
Installation notes for Ubuntu 13.04 Installing CUDA sudo apt-get install nvidia-cuda-toolkit Installing Git apt-get install git Installing CMake sudo apt-get install cmake Build with ...\tsi-gpu> mkdir build $$ cd build ...\tsi-gpu/build> cmake ../ ...\tsi-gpu/build> make All executables will be in /build/bin and all libraries will be in /build/lib/.
1
Installation notes on an Amazon instance 1. Create an amazon instance by following amazon’s instructions (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-setup-for-amazon-ec2.html). Select ubuntu-precise-12.04-amd64-server, and selelct the GPU intance type g2.2xlarge. 2. After the instance is setup, SSH into it. 3. Setup dependencies needed to install CUDA (gcc): sudo apt-get update sudo apt-get install gcc 4. Download and install CUDA. For our choice in os, grap the following .deb file.:
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1204/x86_ 5. Then run: sudo dpkg -i cuda-repo-ubuntu1204_5.5-0_amd64.deb sudo apt-get update sudo apt-get install cuda 6. Setup environment; Run the following lines. Add them to ~/.bashrc to make it permanent. export PATH=/usr/local/cuda-5.5/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64:$LD_LIBRARY_PATH 7. Install CUDA samples (optional) to some directory: cuda-install-samples-5.5.sh . 8. Verify an example works: cd
/1_Utilities/deviceQuery make ./deviceQuery 9. Now that CUDA is installed, lets start building the project. First, clone the project from Github. sudo apt-get install git git clone https://github.com/hgranlund/tsi-gpgpu.git 1
10. Install cmake and build the project. cd tsi-gpgpu sudo apt-get install cmake mkdir build && cd build cmake .. make 11. Now the project is created, test the solution by: make test
2
70
Appendix
C
Source code C.1
API header
# ifndef # define
KNN GPGPU KNN GPGPU
# include ” point . h” void buildKdTree ( s t r u c t P o i n t * points , i n t n t r e e , s t r u c t Node * t r e e ) ; v o i d q u e r y A l l ( s t r u c t P o i n t * h q u e r y p o i n t s , s t r u c t Node * h t r e e , i n t n qp , i n t n t r e e , i n t k , i n t * h r e s u l t ) ; v o i d c u Q u e r y A l l ( s t r u c t P o i n t * q u e r y p o i n t s , s t r u c t Node * t r e e , i n t n qp , i n t n t r e e , i n t k , i n t * r e s u l t ) ; v o i d mpQueryAll ( s t r u c t P o i n t * q u e r y p o i n t s , s t r u c t Node * t r e e , i n t n qp , i n t n t r e e , i n t k , i n t * r e s u l t ) ; void k n n b r u t e f o r c e g a r c i a ( f l o a t * r e f h o s t , i n t r e f w i d t h , f l o a t * query host , int query width , int height , int k , float * dist host , int * ind host ) ; void k n n b r u t e f o r c e ( f l o a t * r e f h o s t , i n t ref nb , f l o a t * q u e r y h o s t , i n t dim , i n t k , f l o a t * d i s t h o s t , i n t * ind host ) ;
/ / #### U t i l s s i z e t getFreeBytesOnGpu ( ) ; 71
void cuSetDevice ( i n t device ) ; i n t cuGetDevice ( ) ; i n t cuGetDeviceCount ( ) ; / / Tree b u i l d s i z e t getNeededBytesForBuildingKdTree ( int n t r e e ) ; s i z e t getTreeSize ( int n tree ) ; / / Search s i z e t g e t N e e d e d B y t e s F o r Q u e r y A l l ( i n t n qp , i n t k , i n t n tree ) ; s i z e t g e t N e e d e d B y t e s I n S e a r c h ( i n t n qp , i n t k , i n t n t r e e , i n t thread num , i n t block num ) ;
# endif / /
C.2
KNN GPGPU
Brute Force
# ifndef # define
DATA TYPES DATA TYPES
struct Distance { int index ; float value ; device host v o l a t i l e D i s t a n c e &o p e r a t o r = ( v o l a t i l e D i s t a n c e &a ) v o l a t i l e { index = a . index ; value = a . value ; return * t h i s ; } }; # endif / /
C.2.1
DATA TYPES
Bitonic sort version
# ifndef # define
KNN BRUTE FORCE KNN BRUTE FORCE
global void cuComputeDistanceGlobal ( f l o a t * ref , i n t r e f n b , f l o a t * q u e r y , i n t dim , float * dist ) ; 72
global void c u B i t o n i c S o r t ( f l o a t * d i s t , i n t * ind , i n t n , int dir ) ; global void c u P a r a l l e l S q r t ( f l o a t * d i s t , i n t k ) ; void b i t o n i c s o r t ( f l o a t * d i s t d e v , i n t * ind dev , i n t n , i n t dir ) ; void k n n b r u t e f o r c e b i t o n i c s o r t ( f l o a t * r e f h o s t , i n t r e f n b , f l o a t * q u e r y h o s t , i n t dim , i n t k , f l o a t * dist host , int * ind host ) ; # endif
/ / Includes # i n c l u d e # i n c l u d e < s t d i o . h> # i n c l u d e