1 Generating example tuples for Data-Flow programs in Apache Flink Master Thesis by Amit Pawar Submitted to Faculty IV, Electrical Engineering and Com...
1 Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Candidate Andrea Spina Advisor Pr...
1 Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerm...
1 APACHE FLINK S JOURNEY FROM ACADEMIA INTO THE ASF - FABIAN HUESKE, APACHE FLINK PMC MEMBER Apache, the names of Apache projects and their logos, and...
1 Apache Flink Streaming Done Right Till2 What Is Apache Flink? Apache TLP since December 2014 Parallel streaming data flow runtime Low latency & ...
1 Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center bbdc.berlin XLDB Berlin Big Data Center All Rights Reserved DIMA 20172 ...
1 Streaming Analytics with Apache Flink Stephan2 Apache Flink Stack Libraries DataStream API Stream Processing DataSet API Batch Processing Runtime Di...
1 Apache Flink : Stream and Batch Processing in a Single Engine Paris Carbone Asterios Katsifodimos * Stephan Ewen Volker Markl * Seif Haridi Kostas T...
1 Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof ...
1 Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds Shelan Perera, Ashansa Perera, Kamal Hakimzadeh SCS - Software...
1 Development of a News Recommender System based on Apache Flink Alexandru Ciobanu 1 and Andreas Lommatzsch 2 1 Technische Universität Berlin Str...
Generating example tuples for Data-Flow programs in Apache Flink Master Thesis by Amit Pawar Submitted to Faculty IV, Electrical Engineering and Computer Science Database Systems and Information Management Group in partial fulfillment of the requirements for the degree of Master of Science in Computer Science as part of the ERASMUS MUNDUS programme IT4BI at the ¨ t Berlin Technische Universita July 31, 2015 Thesis Advisors: Johannes Kirschnick Thesis Supervisor: Prof. Dr. Volker Markl
i
Eidesstattliche Erkl¨ arung Ich erkl¨are an Eides statt, dass ich die vorliegende Arbeit selbstst¨andig verfasst, andere als die angegebenen Quellen/Hilfsmittel nicht benutzt, und die den benutzten Quellen w¨ortlich und inhaltlich entnommenen Stellen als solche kenntlich gemacht habe.
Statutory Declaration I declare that I have authored this thesis independently, that I have not used other than the declared sources/resources, and that I have explicitly marked all material which has been quoted either literally or by content from the used sources.
Berlin, July 31, 2015
Amit Pawar
GENERATING EXAMLE TUPLES FOR DATA-FLOW PROGRAMS IN APACHE FLINK by Amit Pawar Database System and Information Management Group Electrical Engineering and Computer Science Masters in Information Technology for Business Intelligence
Abstract Dataflow programming is a programming paradigm where a computational logic is modeled as a directed graph from the input data sources to the output sink. The intermediate nodes between sources and sink act as a processing unit that defines what action is to be performed on the incoming data. Due to its inherent support for concurrency, dataflow programming is a natural choice for many data-intensive parallel processing systems and is being used extensively in the current big-data market. Among the wide range of parallel processing platforms available, Apache Hadoop (with MapReduce framework), Apache Pig (runs on top of MapReduce in Hadoop ecosystem), Apache Flink and Apache Spark (with their own runtime and optimizer) are some of the examples that leverage dataflow programming style. Dataflow programs can handle terabytes of data and perform efficiently, but the target data of such large-scale introduces difficulties such as understanding the complete dataflow (what is the output of the dataflow or any intermediate node), debugging (it is impractical to track large-scale data throughout the program using breakpoints or watches) and visual representation (it is quite difficult to display terabytes of data flowing through the tree of nodes). This thesis aims to address these limitations for dataflow programs in Apache Flink platform using the concept of Example Generation, a technique to generate sample example tuples after each intermediate operation from source to sink. This allows the user to view and validate the behavior of the underlying operators and thus the overall dataflow. We implement the example generator algorithm for a defined set of operators, and evaluate the quality of generated examples. For ease of visual representation of the dataflow, we integrate this implementation with the Interactive Scala Shell available in Apache Flink.
Acknowledgements I would like to express my deep sense of gratitude to my advisor Johannes Kirschnick for his excellent guidance, scientific advice and constant encouragement throughout this thesis work. His apt assistance helped me understand and tackle many challenges during the research, implementation and writing of this thesis. I would like to thank Prof. Dr. Volker Markl and Dr. Ralf-Detlef Kutsche for their kind co-ordination efforts. I would also like to thank Nikolaas Steenbergen, Stephan Ewen and all the members of Flink dev user group for their kind advice and clarifications on understanding the platform. I express my sincere thanks to all the staffs and professors from IT4BI programme for their help and support during my Masters degree. Finally, I would like to thank all my friends from IT4BI programme, generation I and II, for their constant support and cherishable memories over the past two years.
Locally created DataSet used for evaluation purpose . . . . . . . . . 52 Experiment Dataset’s description and sizes . . . . . . . . . . . . . . 57
vii
Chapter 1 Introduction Dataflow programming has gained popularity with the booming big data market over the past decade. It is a data processing paradigm that allows data movement to be the focal point, in contrast to the traditional programming (object-oriented or imperative or procedural), where passing control from one object to another forms the basis of the respective programming model. The distributed data processing framework leverages dataflow programming by splitting and distributing large dataset across different computing nodes, where each node performs dataflow operation on the local data. A single dataflow program can be seen as a directed graph from the source nodes to the sink nodes, where source nodes represent the input datasets consumed and sink nodes represent output datasets generated by executing the dataflow program. All the intermediate nodes, between source and sink, act as a processing unit that defines what action is to be performed on the incoming data. These actions can be categorized as either: i. General relational algebra operators (e.g., join, cross, project, distinct, etc.) or, ii. User-defined function operators (e.g., flatmap, map, reduce, etc.). Apache Hadoop [1] with its MapReduce [2] framework uses dataflow programming style. Apache Flink [3] is a distributed streaming dataflow engine that falls into a newer category of big data framework, an alternative to Hadoop’s MapReduce, along with Apache Spark [4]. Other dataflow programming systems include Apache Pig [5], Aurora [6], Dyrad [7], River [8], Tioga [9] and CIEL [10]. This thesis is based on dataflow programs from Apache Flink, where we generate sample examples after each intermediate node (operator) from source to sink, allowing the Flink user to: 1
Chapter 1. Introduction
2
1. View and validate the behavior of the underlying set of operators and thus understand and learn the complete dataflow 2. Optimize the dataflow by determining the correct set of operators to achieve the final output 3. Monitor iterations in case of an iterative KDDM (Knowledge discovery and Data Mining) algorithm 4. Understand the behavior of User-defined functional (UDF) operators (seen as a blackbox)
1.1
Motivation
Dataflow programming, similar to any other programming paradigm, is an iterative/incremental process. In order to program a final correct version of the dataflow, the user may be subjected to more than one iteration. Each iteration consist of steps like; coding a dataflow, building and executing it on the respective system and finally analyzing the output or the error log to determine whether the dataflow resulted in the expected outcome, if not, the user revises the steps (normally via debugging) and repeats with a new iteration. Any programming paradigm when dealing with a large-scale dataset results in a longer execution/testing time and same applies to the dataflow programming. Hence, the iterative model of development when handling the large-scale data is a time consuming process and, therefore, inefficient. The whole process can be made efficient if the user can verify the underlying execution of the dataflow, i.e., verifying execution at each node (operator) in the dataflow. This can be done by checking the dataset that is being consumed and generated at any given operator, this in turn allows the user to pin-point and rectify the logic or error (if any). Visualizing the dataflow with example dataset after each operator execution will allow the user to test the necessary assumptions made in the program, in a way, it voids the need of debugging using breakpoints and watches. The overall process of example generation after each operator permits the user to learn about the logic of the operators as well as the complete dataflow program. Dataflow program is a tree of operators from source to sink, where one or more sources (leaves) converge into a single sink (root) via different intermediate nodes.
Chapter 1. Introduction
3
In order to get an optimal performance from the program, choosing the appropriate operator type is of utmost importance. For example the Join operator in Apache Flink can be executed in multiple ways such as Repartition or Broadcast, depending on the input size and order. With these options, the user can select the best operator to optimize the overall performance. Similarly, the user can decide to replace Join or any costly operator, with a transformation or aggregation or other cheaper operators as long as the final objective is accomplished. Thus dataflow programming can be seen as a plug-n-play like model, where the user can plug/unplug (add/remove) operators then play (execute) the complete dataflow in order to decide on the most optimal set of operators for the final version of the program. In such scenario, having a concise set of examples (instead of large datasets) that completely illustrates the dataflow would be of a great help to the user as it avoids repetitive cost and time heavy execution of the program after each plugging or unplugging. Dataflow programming frameworks such as Flink and Spark are well suited for the implementation of Machine-Learning algorithms for Knowledge Discovery and Data Mining (KDDM). KDDM in itself is an iterative process, where a model (classification, clustering, etc.) is repetitively trained for better prediction accuracy, for e.g., the k-means clustering algorithm, an iterative refinement process, is used in many data mining or related applications. In k-means algorithm, given the input of k clusters and n observations, the observations are iteratively alloted to the most appropriate cluster based on the computation of the cluster mean. This problem is considered to be NP-hard and hence training such predictive models can be a cumbersome process, where constant fine-tuning (via re-sampling or dataset feature changes or mathematical re-computation) is needed. The overall modeling process in a dataflow environment can be complicated [11] and is pictured as a blackbox, because the user has very less clue about how the newly tuned dataflow might behave and has to wait throughout the process execution, only after that the user can verify the output to take the necessary action. Having sample examples that demonstrates the dataflow quickly will be of a great help to the user, as it opens up the blackbox and provides a quick efficient way for fine tuning the dataflow and in turn the predictive model. Out of several operators available in the dataflow programming, UDF operators pose a problem of readability/understandability to a new user (one who is new to the system or code), as it might be difficult to guess what is the purpose of the
Chapter 1. Introduction
4
respective UDF. It can be a grouping function or a transformation function or a partition function, in order to interpret a UDF operator, one needs to investigate at the code level, which can be a tedious task for a new user. Instead, if the same user has access to input consumed and the output generated by the UDF operator, it will be easier to reason the behavior of that very operator. The idea of presenting examples after each operator has been realized and is being used for testing and diagnostics purpose in Apache Pig. Apache Pig, a high-level dataflow language and execution framework for parallel computation, runs on top of MapReduce framework in Hadoop ecosystem. Pig features a Pig Latin language layer that simplifies MapReduce programming via its set of operators. One of the diagnostic operators include ILLUSTRATE 1 , that displays sample examples after each statement in a step-by-step execution of a sequence of statements (a dataflow program), where each statement represents an operator. It allows user to review how data is transformed through a sequence of Pig Latin statements (forming a dataflow). This feature is included in a testing and diagnostics package of Pig, as it allows the user to test the dataflows on small datasets and get faster turnaround times. In this thesis, we enhance the understanding of large-scale dataflow program execution by introducing a new feature in Apache Flink which helps the user by providing sample examples at operator level, similar to the ILLUSTRATE function in Apache Pig. This thesis work, like Apache Pig, is based on an example generator [12]. The algorithm used in example generator works by retrieving a small sample of the input data and then propagating this data through the dataflow, i.e., through the operator tree. However, some operators, such as JOIN and FILTER, can eliminate these sample examples from the dataflow. For e.g., an operator performing a join on two input datasets A(x,y) and B(x,z) on common attribute x (a join key), if both A and B contains many distinct values for x, initial sampling at A and B may lead to a possibility of unmatched x values. Hence, join may not produce an output example due to the absence of common join key in both datasets. Similarly, filter operator executed on a sample dataset might produce an empty result if no input example satisfies the filtering predicate. To address such issues, the algorithm used in [12] will generate synthetic example data, that allows the user to examine the complete semantics of the given dataflow program. 1
This thesis concentrates on dataflow programs from Apache Flink. Our main goal is to implement ILLUSTRATE like example generator feature of Pig in Flink, such that it will help users to tackle the problems that are faced with respect to dataflow programming in a big-data environment. As this thesis is inclined towards implementation of a new feature in Flink, we try to answer following questions:
1. What are the challenges in implementing the example generator and ways to tackle them? 2. How can the concept of example generator be realized in Flink? What inputs are required for the implementation and how they can be obtained from Flink? 3. What extra features of Flink can be exploited in order to differentiate it from other implementations? 4. How well does the implemented algorithm performs with respect to the metrics defined?
1.3
Outline
The outline lays out the approach that we followed for the thesis that subsequently helped us to find the answers of the questions mentioned in the previous section.
• A comprehensive study on dataflow programming and the systems that take the advantage of this paradigm (Chapter 2) • A broad introduction to the example generation problem, challenges and the approach towards the solution (Chapter 3) • Implementation of the example generator algorithm in Flink, with extensive description of the input construction, various algorithm steps and the Flink features used (Chapter 4)
Chapter 1. Introduction
6
• Experimenting and evaluating the performance of the implemented algorithm against the defined metrics by executing the implemented code on different dataflow programs and diverse datasets, covering the operators as well as the stages of the algorithm (Chapter 5) • A brief overview of the related works in the field of example generation (Chapter 6) • We conclude by discussing the results and findings (Chapter 7)
Chapter 2 Dataflow Programming Dataflow programming introduces a new programming paradigm that internally represents applications as a directed graph, similar to a dataflow diagram [13]. Program is represented as a set of nodes (also called as blocks) with input and/or output ports in them. These nodes act as source, sink or processing blocks to the data flowing in the system. Nodes are connected by directed edges that define the flow of data between them. It is reminiscent to the Pipes and Filters software architectural model, where Filters are the processing nodes and Pipes serves the passage for the data streams between the filters. One of the main reasons why dataflow programming surged with the hype of big-data is, its by default support for concurrency and thus allowing increased parallelism. In a dataflow program, internally, each node is an independent processing block, i.e., each individual node can process on its own once they have their respective input data. This kind of execution allows data streaming, an intermediate node in the dataflow starts functioning as soon as the data arrives from previous node and transfers the output to the next node. Hence, this avoids the need of waiting for the previous node to finish its complete execution. In a big-data environment, the aforementioned characteristics, parallelism and streaming, allows dataflow programs to be run on a distributed system with computer clusters. Such system splits the large dataset into smaller chunks and distributes it across cluster nodes, each node then executes the dataflow on their local chunk. The sub-result produced after execution at all the nodes are combined to get the final result for the large dataset. This is the main gist on how 7
Chapter 2. Dataflow Programming
8
a dataflow program is executed on big-data system with a distributed processing environment. In this section, we discuss different big-data analytics systems that take the advantage of the dataflow programming paradigm.
2.1
Apache Hadoop
Apache Hadoop [1] is a big-data framework that allows distributed processing of large-scale dataset across clusters of computers using a simple programming models. Hadoop leverages dataflow programming via MapReduce module. Hadoop MapReduce is a software framework for writing applications and process vast amount of data (multi-terabyte datasets) in-parallel on large clusters in a reliable, fault-tolerant manner. A typical MapReduce program consist a chain of mappers (with Map method) and reducers (with Reduce method) in that order. These Map and Reduce methods are second order functions that consume input dataset (either in whole or part of large dataset) and a user defined function (UDF) that is applied to the corresponding input data. In a MapReduce program, the mappers and reducers form the processing nodes of the dataflow with sources and sinks (of respective formats) explicitly mentioned. Irrespective of being able to scale to very large datasets, previous studies have reported the limitations of Hadoop MapReduce and issues that can be related to dataflow programming are listed as follows:
• Lack of support for dedicated relational algebraic operators such as Join, Cross, Union and Filter [14, 15]. These operators are frequently used in many iterative algorithms, for e.g., Join is an important operator in a PageRank algorithm. This restricts the user to custom code the logic for handling the above mentioned common operations and the programmer needs to think in terms of map and reduce to implement them. • Lack of inherent support for iterative programming [16], an integral feature of any dataflow programming framework as well as useful for many data analytics algorithms. For e.g., k-means algorithm, an iterative clustering process of finding k clusters. The workaround for iteration in MapReduce is to have
Chapter 2. Dataflow Programming
9
an external program that repeatedly invokes mappers and reducers. But, this comes with an added overhead of increased IO latency and serialization issues with no explicit support for specifying termination condition.
The above limitation of relation operators is addressed by Hadoop through another module, Apache Pig.
2.2
Apache Pig
Apache Pig is a high-level dataflow language and execution framework for parallel computation in Apache Hadoop. Apache Pig [5] was initially developed at Yahoo to allow Hadoop users to focus more on analyzing dataset rather than to invest time on writing complex code using the map and reduce operators. Internally, Pig is an abstraction over MapReduce, i.e., all Pig scripts are converted into Map and Reduce tasks by its compiler. Thus, in a way Pig makes programming MapReduce applications easier. The language for the platform is a simple scripting language called Pig Latin [17], which abstracts from the Java MapReduce idiom into a form similar to SQL. Pig Latin allows the user to write a dataflow that describes how the data will be transformed (via aggregation or join or sort) as well as develop their own functions (UDFs) for reading, processing and writing data. A typical Pig programs consist of following steps: 1. LOAD the data for manipulation. 2. Run the data through a set of transformations. These transformations can be either a relation algebra transformation (join, cross, filter, etc.) or an user defined function. (All the transformations are internally translated to Map and Reduce tasks by the compiler.) 3. DUMP (display) the data to the screen or STORE the results in a file. When relating a Pig program to a dataflow; LOAD operators are the source nodes, the set of transformations forms the processing nodes and DUMP or STORE are the sink nodes.
Chapter 2. Dataflow Programming
10
Pig addresses the limitations of MapReduce by providing a suite of relational operators for the ease of data manipulation [5]. Though, in order to achieve iterations as well as other control flow structures (if else, for, while, etc.) one needs to use Embedded Pig, where Pig Latin statements and Pig commands are nested into scripting languages such as Python, JavaScript or Groovy. This reduces the simplicity of programming (one of the major selling point) in Pig, as it introduces JDBC-like compile, bind, run model that adds extra overhead of complex invocations. As Pig translates all the statements and commands from the scripts into MapReduce tasks, it is considered to be slower than a well-written/implemented MapReduce code 1 . Although Pig offer better scope for optimization than MapReduce, an optimized Pig script can perform on par with MapReduce code.
2.3
Apache Flink
Apache Flink (formerly known as Stratosphere) [3] is a data processing system and an alternative to Hadoop’s MapReduce module. Unlike Pig, that runs on top of MapReduce, Flink comes with its own runtime, a distributed streaming dataflow engine that provides data distribution and communication for distributed computations over data streams. It features powerful programming abstractions in multiple languages, Java and Scala, thus providing the user different language options to program a dataflow. Flink also supports automatic program optimization, allowing user to focus more on other data handling issues. It has native support for iterative programming via iteration operators such as bulk iterations and incremental iterations. Also, it provides support for program consisting of large directed acyclic graphs (DAGs) of operations. One of the essential components in Flink framework is the Flink Optimizer [18], that provides automatic optimization for a given Flink job as well as offers techniques to minimize the amount of data shuffling thus formulating an optimized data processing pipeline. Flink Optimizer is based on a PACT [19] (Parallelization Contract) programming model that extends the concepts from MapReduce, but is also applicable to more complex operations. This allows Flink to extend its support for relational operators such as Join, Cross, Union, etc. The output of the Flink optimizer is a compiled and optimized PACT program, which is nothing but 1
a DAG-based dataflow program. This is how dataflow programming is achieved in Apache Flink. A typical Flink program consists of the same basic steps: 1. Load/Create the initial data. 2. Specify transformations of this data. 3. Specify where to put the results of your computations, and 4. Trigger the program execution. In Step 1 we mention the source nodes for the dataflow program, Step 2 transformations forms the processing nodes and in Step 3 we mention the sink nodes. Flink’s runtime natively supports iterative programming [20], the feature lacking in MapReduce and not easily accessible in Pig, through its different types of iteration operators: Bulk and Delta. These operators encapsulate a part of the program and execute it repeatedly, feeding back the result of one iteration into the next iteration. It also allows to explicitly declare the termination criterion. Such an iterative processing system makes Flink framework extremely fast for data-intensive and iterative jobs compared to Hadoop’s MapReduce and Apache Pig.
2.4
Comparison of Big-Data Analytics Systems
The sample dataflow program, Word-Count (reads text from files or string variable and counts how often words occur), implementation is presented in Appendix for each framework, Hadoop MapReduce A.1.2, Pig A.1.1, and Flink A.1.3, in order to observe the respective programming technique and to get an idea about the efforts needed to code the same. As shown in Table 2.1, Apache Flink has a clear advantage over the other two systems, with its faster processing times and native support for relational operators as well as iterative programming. This makes Flink a certain choice when implementing a dataflow program for a large-scale dataset.
Chapter 2. Dataflow Programming
12
Apache Hadoop (MR)
Apache Pig
Apache Flink
Framework
MapReduce
MapReduce
Flink optimizer and Flink runtime
Language Supported
Java, C++, etc.
Pig-Latin
Java, Scala, Python(beta)
Dataflow Nodes Processsing Nodes
Mappers and Reducers
Suite of operators, UDFs
Suite of Operators
Ease of Programming
Simple but tricky when implementing join or similar Simple operators
Simple
Relational Operators
Not natively supported, implemented via MR
Natively supported
Natively supported
Iterative Programming Not natively supported
Supported via Embedded Natively supported Pig
Processing Time
Slow
Fast
Fastest
Table 2.1: Comparison of Big-Data systems in context of Dataflow Programming
2.5
Limitations of Dataflow Programming
As discussed in [13], the major limitations in a dataflow programming paradigm are visual representations and debugging. In a big-data environment, with large-scale data flowing, these limitations are indeed more severe [21]. It is quite impractical to visually represent terabytes of data flowing through the tree of operators, and denoting what action is being performed on that data at each operator. For tracking an error in a program, the user often introduces breakpoints and watches to monitor the flow of execution with changes in the variable values or the output data. Although this approach of monitoring is futile when it comes to data of such vast size. This thesis work tries to address these limitations in Apache Flink environment. For a Flink dataflow program, we generate concise set of example data at each node in the operator tree such that it allows the user to validate the behavior of the operator as well as the complete dataflow. It eliminates the need of debugging (via breakpoints, watches, etc.) to an extent, as the user can diagnose the error (after seeing the flow of sample examples in the dataflow) by locating the problematic operator in the tree and rectifying the logic at that very operator. Visual representation of a dataflow is made available in Flink via its Web-Client interface or Flink job submission interface, though the flow of data is not a part of this representation. This thesis work integrates with the Flink’s Interactive Scala Shell, a new feature in Flink, to display the set of examples for the dataflow
Chapter 2. Dataflow Programming
13
program executed via the interactive shell. Thus making the generated examples visually accessible to the users. This way we address the limitations of a dataflow programming paradigm in a big-data system, Apache Flink. In next section, we define the Example generation problem, explain the different approaches to generating a set of concise examples that allows the user to reason the complete dataflow.
Chapter 3 Example Generation In this chapter, we describe the problem of generating example records for dataflow programs and mention the already existing approaches for the same, followed by the challenges faced and their drawbacks. We also describe the theoretical terms and concepts used throughout this thesis, with a brief introduction to the algorithm that betters the drawbacks in the existing approaches.
3.1
Definition
Dataflow program is a directed graph G=(V,E) where V is the set of nodes denoting operators (source, data transformation/processing node, sink) and E is the set of directed edges denoting the flow of data from one operator to the next. Example generation is a process of producing/generating a set of concise examples after each operator in the dataflow, such that it allows a user to understand the complete semantics of the dataflow program. Let us demonstrate this concept using an input dataflow example that returns a list of highly populated countries.
Figure 3.1: Dataflow that returns highly populated countries
14
Chapter 3. Example Generation
15
Figure 3.1 is a dataflow that LOADs two datasets Countries (ISO Code, Name) and Cities (Name, Country, Population1 ) and performs a JOIN on the attribute name/country (Countries/Cities). The joined result is then grouped by countries and aggregation (SUM) is performed on population. Later we filter out the countries with population less than and equal to 4 million and present the list as output. Example generation when applied to above dataflow from Figure 3.1, gives us the output, shown in Figure 3.2, where we have sample examples after each operator to facilitate the understanding of the operators as well as the complete dataflow for a user.
Figure 3.2: Sample output of Example Generation on dataflow from 3.1
This allows user to verify the behavior of each operator just by looking at the output sample, for e.g., verifying whether aggregation is having an intended effect or filter is behaving correctly. Instead of cross-checking the whole logic, user will now be able to target only the faulty operators. Figure 3.1 and Figure 3.2 are respectively considered as the input and the output of any example generation algorithm. Let us now briefly explain the terms used throughout this thesis with respect to the dataflow example from Figure 3.1 1
in million
Chapter 3. Example Generation
3.2
16
Dataflow Concepts
Source: Load operators are the sources in the above dataflow, as they read the data from the input files/tables/collections. Sink: Operators that produce (by storing in a file or displaying) the final result is the Sink. Filter is the Sink in the above dataflow. Downstream pass: Starting from Sources we move in the direction toward Sink, i.e., from Loads to Filter. Upstream pass: Starting from Sink we move in the direction toward Sources, i.e., from Filter to Loads. Downstream operator/neighbor: If Operator1 consumes the output of Operator2, it makes Operator1 as the downstream neighbor of Operator2. Join is the downstream operator of both the Loads, similarly Group is the downstream neighbor of Join and Filter is of Group. Upstream operator/neighbor: If Operator1 consumes the output of Operator2, it makes Operator2 as the upstream neighbor of Operator1. Group is the upstream neighbor of Filter, similarly Join is of Group and Loads are of Join. Operator Tree: The complete chain of operators from sources to sink makes an operator tree. The operator tree is the representation of the given dataflow program in term of operators used. All the concepts can be visualized as shown in Figure 3.3
Figure 3.3: Concepts related to a dataflow program
Chapter 3. Example Generation
3.3
17
Example Generation Properties
Let us now briefly explain the properties that define a good example generation technique. Completeness The examples generated at each operator in the dataflow should collectively be able to illustrate the semantics of that operator. For e.g., in above dataflow output Figure 3.2, the examples before and after the Filter operator clearly explains the semantics of filtering by population, as examples with low population are not propagated to the output. Same can be said with respect to Group operator, the examples illustrate the meaning of grouping of countries and aggregating (via sum) the population. Completeness is the most important property, as it allows user to verify the behavior of each and every operator in the dataflow. Conciseness The set of examples generated after each operator should be as small as possible, such that it minimizes the effort of examining the data in order to verify the behavior of that operator. The output presented in Figure 3.2 cannot be considered as concise, as there is a scope to explain the behavior of the operators will smaller set of examples, as shown in Figure 3.4. Realism An example is considered to be real if it is present in the respective source file/table/collection. Thus, the set of examples generated by any of the operator must be the subset of the examples in the source, in order to be considered as a real set. In the dataflow output, shown in Figure 3.2, assume that all the examples generated by Load operators are from the respective source files, this means all the examples generated in the output 3.2 are real. In Figure 3.4, we have altered the output of the dataflow (Figure 3.2), to make it more concise and thus ideal with respect to the properties defined.
Chapter 3. Example Generation
18
Figure 3.4: Ideal set of examples satisfying all properties
3.4
Example Generation Techniques
In this section, we discuss the techniques/approaches that can be considered for example generation and the problems related to each approach and how we can overcome the same.
3.4.1
Downstream Propagation
The most simple way to generate examples at each operator in the dataflow is to sample few examples from the source files and push these examples through the operator tree, executing each operator and recording the results after each execution. The problem with this approach is, not always the sampled examples would allow all operators in the operator tree to achieve total completeness. For e.g., considering our dataflow example from Figure 3.1, initial sampling might produce examples that cannot perform join and thus resulting the lack of completeness (as shown in Figure 3.5). This can indeed be averted if we increase the sampling size and push as many possible examples through the operator tree [22], but this violates the property of conciseness. As we seek the approach that generates complete and concise set of examples, relying only on downstream propagation cannot be the choice for our implementation.
Chapter 3. Example Generation
19
Figure 3.5: Incompleteness is often the case in downstream propagation
3.4.2
Upstream Propagation
The second approach that can be considered for example generation is to move from sink to source and generate the output examples based on the characteristics of the operator (joined examples, crossed examples, or filtered examples). Based on these output we generate the input and propagate upstream recursively till we reach the sources. This approach will work fine if we are aware of the operator behavior (join, union, cross, etc.) but fails when a UDF operator is part of the operator tree. A UDF being a blackbox, it is complex to predict the behavior beforehand, this approach fails to generate examples resulting incompleteness. The only way to generate examples for these UDF operators is to push data through them in the downstream direction. Hence, both approaches Downstream and Upstream propagation individually is not efficient to generate an ideal set of examples satisfying both completeness and conciseness. Nonetheless, combination of both downstream and upstream with pruning of redundant examples can lead to the better set of results.
3.4.3
Example Generator Algorithm
The algorithm with steps (in that order): 1. Downstream Pass, 2. Pruning, 3. Upstream Pass, 4. Pruning
Chapter 3. Example Generation
20
was proposed in [12] and was observed to be more efficient (to generate complete and concise set of examples) compared to only Downstream and only Upstream approaches. To ensure the completeness of all the operators in an operator tree, [12] introduces the Equivalence Class Model. Equivalence Class Model For a given operator O, a set of equivalence classes εO = {E1 , E2 , ...., Em } is defined such that each class denotes one aspect of the operator semantics. For e.g., a FILTER operator’s equivalence set will be denoted by εf ilter = {E1 , E2 }, where E1 denotes the set of examples passing the filtering predicate, and E2 denotes the set of examples failing the filtering predicate. In section 4.3 we define equivalence classes for the set of supported operators. Following is the pseudo-code of the algorithm [12] used in this thesis for example generation. Algorithm 1 Example Generator Step 1 : use downstream pass to generate possible sample examples at each operator in the operator tree Step 2 : using equivalence class model check for the operator completeness, and for incompleteness (if any) identified, generate sample examples using upstream pass Step 3 : use pruning pass to prune/remove the redundant set of examples from the operators Let us now formally introduce the example generation problem with respect to Apache Flink that we are addressing in this thesis.
3.5
Example Generation In Apache Flink
As discussed in section 1.2, we implement the ILLUSTRATE2 feature available in Apache Pig in a completely distinct computing platform of Apache Flink. This work is based on the paper [12], where they describe and prove the best possible 2
way to generate a set of complete, concise and real examples for a given dataflow program. In the implementation, we have considered few notable differences between Pig and Flink, and thereby defined our requirements,
1. Flink is a multi-language supporting framework (with support for Java, Scala and Python), in contrast to Pig that uses a textual scripting language called Pig Latin. In our implementation, we focus on Flink’s Java jobs. 2. Pig allows ILLUSTRATE either on a complete dataflow script or after any given dataset (also known as a relation) from the script. We have adapted this accordingly to support only a complete Flink job; i.e., a dataflow from source to sink. We are of the opinion that illustrating after the complete dataflow is more feasible than that of after every dataset because eventually all used datasets would be displayed in a complete job illustration. 3. Pig’s input is transformed into a relation, a relation is a collection of tuples and is similar to that of a table in a relational database. In Flink the input can be of any format (txt, csv, hdfs, collection), although for our implementation we need to convert these inputs into tuple format before performing any transformation on the input, as our intention is to find example “tuples”.
These were the differences which we observed and accordingly adapted it in our implementation. In order to view the overall implementation picture let us now briefly explain the basic life-cycle when the flink illustrator, the module that generates sample examples, is invoked by a flink job:
1. For a submitted job, an operator tree is created. 2. This operator tree forms the main input for our algorithm, different passes then acts on the input to generate complete, real and concise examples. 3. These generated examples are then displayed to the user (via a console or scala shell).
After an overview on example generation concepts and the introduction to algorithm as well as equivalence class model, next section provides the detailed explanation of the implementation with respect to Apache Flink, the process of
Chapter 3. Example Generation
22
generating the input (Operator Tree), followed by the list of supported Flink operators and their respective Equivalence classes, and each algorithm pass. Later, we mention the Flink features that have been integrated with the implementation, making it different and novel.
Chapter 4 Implementation After having defined what is example generation and how it is useful in Apache Flink, this chapter extends to give an in-depth proceeding on how this concept is realized in Flink. This topic can be best treated under following headings:
1. Operator Tree 2. Supported Operators 3. Equivalence Class 4. Algorithm 5. Flink add-ons
4.1
Operator Tree
Operator tree is a chain or list of single operators that starts from the input sources and ends at the output sink with one or more data processing/transforming operators between the two. Operator tree is created once the submitted flink job invokes the illustrator module. Before going into further details how we build an operator tree for a given job, let us define the Single Operator object, which is at the granular level of an operator tree.
23
Chapter 4. Implementation
4.1.1
24
Single Operator
Single Operator object defines different properties of the operator under consideration. The idea behind formalizing our own operator object rather than using the one available from Flink
1
was, i. We were not going to use all the properties
of flink operator class, as they were not adding any value to our implementation, e.g., the degree of parallelism and compiler hints. ii. There were few things missing in the available flink operator class which were needed specifically for our implementation purpose, e.g., having join keys easily available for the given input, having the output of each operator for further manipulation, and having details that verify the operator produced an output for the set of inputs were necessary. Our single operator class along with the properties can be viewed in Figure 4.1.
SingleOperator equivalenceClasses: List operator: Operator parentOperators: List operatorOutputAsList: List