Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich © Prof Dr. Matthias Schubert 2016
DATABASE SYSTEMS GROUP
Introduction to Apache Flink
• Apache Flink is an open source Stream Processing Framework • Low latency • High throughput • Stateful Operators • Distributed Execution • Developed at the Apache Software Foundation • 1.0.0 released in March 2016, used in production
DATABASE SYSTEMS GROUP
Flink Software Stack
DATABASE SYSTEMS GROUP
System Legacy Map Reduce OSDI’04
Apache Hadoop 1
Dryad, Nephele EusoSys’07
Apache Tez
PACTs SOCC’10 VLDB’12 RDDs HotCloud 10, NSDO’12
Apache Flink
Apache Spark
Architecture Actor System
Flink Client Code using API Graph Builder & Optimizer
Dataflow Graph
Actor System
Job Manager dataflow graph
Scheduler
• • • •
Actor System
Checkpoint Coordinator
task slot
task slot
task slot
data streams
... Memory/IO Manager Network Manager
Actor System
DATABASE SYSTEMS GROUP
task slot
task slot
task slot
Memory/IO Manager Network Manager
task status heartbeats statistics trigger checkpoints
DATABASE SYSTEMS GROUP
Dataflow Graphs
• all APIs (e.g. DataSet, DataStream,) compile to Dataflow Graphs
Src1
stateful operator
Src2
intermediate data stream
• (stateful) operators (filter, joins,..) = nodes
IS1
IS2
• data streams = links
Snk1
OP1
• in parallel processing split into: • operators are executed in subtasks • stream partitions
• streams may p2p, broadcast, merge, fan-out, repartitions
IS2
Snk1
DATABASE SYSTEMS GROUP
Intermediate Data Streams
• core abstraction for data exchange • may or may not be materialized on disk • pipelined execution: data is continuously produced, buffered and consumed map a b a a b a
reduce
shuffle
(a,1)
(a,1)
(a,{1}) (b,1)
(b,1)
(b,{1}) (a,1)
(a,2) (a,{1,1,1}) (a,3) (b,{1,1}) (a,{1,1,1,1}) (a,1)
(a,{1,1}) (a,1) (b,1)
(b,2) (a,4)
• blocking data exchange: output is generated, stored and then exchanged with the consumer. (->complete intermediate results of a stream must be stored) map a b a a b a
shuffle (a,1) (b,1) (a,1) (a,1) (b,1) (a,1)
reduce (a,{1,1,1,1}) (b,{1,1})
(a,4) (b,2)
DATABASE SYSTEMS GROUP
Latency and Throughput
Data exchange based on buffers: • data record ready => one/many buffers • buffer is sent to consumer when it is full / time out the large buffers increase throughput (less overhead) low time out enable low latencies (real time processing = data is processed within a guaranteed time limit) 100
120
90
100
70
80
60
60
50 40
40
30 20
20 0
10 0 5 10 50 100 Buffer timeout (milliseconds)
0
Throughput (Average in millionsof events/sec)
80
Latency 99th-percentile in milliseconds
⇒ ⇒
DATABASE SYSTEMS GROUP
Control Events and Fault Tolerance
• Examplary types of control events: • check point barrier: coordinate checkpoints by dividing stream into pre-checkpoint and post-checkpoint • watermarks: signaling the progress of event-time within the stream partition • iteration barriers: signals end of a superstep for iterative processing • Control events are injected into the stream and provide operator nodes the position in the data set. • reliable execution with exactly once • consistency is guaranteed (no availability on all nodes) • check-pointing and partial re-execution • based on the assumption that data source is persistent and replayable (e.g. files, Apache Kafka) • regular snapshots to prevent unbounded recomputation
DATABASE SYSTEMS GROUP
Asynchronous Barrier Snapshotting
• barrier corresponds to a logical time => separate the stream to mark the snapshotted part • barriers are injected into the stream • wait until all barriers from input are received • write out state to durable storage (=disk) • checkpoint barriers are sent from upstream to downstream after checkpoint • recovery: restart computation from the last successful snapshot
t1 barrier snap shotting data stream
t2 barrier
t3 barrier
snap shotting time
snap t1
snap t2
DATABASE SYSTEMS GROUP
• •
Iterative algorithms are often employed for Data Mining, Machine Learning or Graph processing in other cloud-based computation frameworks (e.g. Hadoop, Spark): • •
• • •
Iterative Data Flows
run a loop in the client program in each iteration a parallel execution is started (compare to k-Means on Hadoop)
Flink provides an integrated iteration processing iteration step = special operators contain execution graphs iteration head and iteration tail are connected via feedback stream (handles what to keep between iterations) feedback stream loop control event
data record in loop transit
Iteration Step
Src data record outside loop
SNK
DATABASE SYSTEMS GROUP
• •
• • • • •
Stream Processing with Dataflows
Flink manages time: out-of-order events, windows, user-defined states two notions of time: •
event time: time when the event is originated (e.g. timestamp)
•
processing time: wall-clock time of processing the event at worker X
Skew between both is possible in distributed environments: objects may arrive out of order with respect to event time low watermarks: mark global progress measure (e.g. all events lower than timestamp t have entered an operator) Watermarks originate at the sources of the graph operators decide how to react operators with multiple inputs forward minimal watermarks
DATABASE SYSTEMS GROUP
•
stateless operators: operator works independent for all inputs •
for example simple map function in word count : lambda x: (x,1)
•
no memory, not depending on the input order
•
stateful operators: operator has an internal state •
for example: regression function: a⋅x+t. (a and t are trained over the stream of input data)
•
the state stores models parameters
•
•
Stateful Streams Processing
states are incorporated into the API by : •
operator interfaces registering local variables
•
operator-state abstractions for declaring portioned key-value states as there associated operations
states can be checkpointed
Stream Windows
DATABASE SYSTEMS GROUP
•
• •
Stateful operator configured via: •
assigner: assigns each record to one/many logical windows
•
trigger(optional): states the time an operation on the windows is performed
•
evictor(optional): defines which records to retain in each window
Predefined operator available e.g. sliding time window user-defined functions allow flexible customizing
Examples: stream .window(SlidingTimeWindows.of(Time.of(6, SECONDS), Time.of(2, SECONDS)) .trigger(EventTimeTrigger.create()) stream .window(GlobalWindow.create()) .trigger(Count.of(1000)) .evict(Count.of(100))
DATABASE SYSTEMS GROUP
• • •
Batch Processing
batch processing can be considered as special case of streams (bounded streams) Syntax for batch processing can be defined in a simpler way additional options for optimizing the processing might be possible
⇒ Flink offers additional functionality for batch processing ⇒ Blocked execution: break up large computations to isolated stages ⇒ No periodic snapshotting when overhead is large instead use last materialized intermediate stream ⇒ blocking is implemented as an operator explicitly waiting until the complete input is consumed => runtime environment does not distinguish ⇒ disk spill-off might become necessary ⇒ Flink provides a dedicated DataSet API with familiar functions e.g. map ⇒ Query optimization is used to transform API programs into efficient graphs
Query Optimization
DATABASE SYSTEMS GROUP
•
query optimizer is built on techniques from parallel databases: • • •
• • •
problem the operators have no predefined semantics (user defined functions!) cardinality and cost-estimation are hard to perform for the same reasons support execution strategies such as: • • •
• • •
plan equivalence cost modeling interesting-property propagation
repartition and broadcast sort-based grouping sort- and hash-based joins
Optimizer evaluated physical plans by interesting property propagation costs include disk I/O and CPU cost to handle user defined functions, hints are allowed
• • • • •
Memory Management Flink serializes data into memory segments instead of using the JVM heap operations work as much as possible on the binary data => reduces the overhead for serialization /deserialization for arbitrary objects, Flink uses type inference and custom serialization Binary representation and storing data off-heap reduces garbage collection overhead spilling data to disk is still fallback in case
Unmanaged Heap JVM Heap
DATABASE SYSTEMS GROUP
Flink Managed Heap
public class WC { public String word; public int count; }
empty page
Network Buffers Pool of Memory Pages
DATABASE SYSTEMS GROUP
Batch Iterations
• iterative methods are common in data analytics: • parallel gradient descent • expectation maximization
• Parallelization methods for iterative methods • Bulk Synchronous Parallel (BSP) • Stale Synchronous Parallel (SSP)
• Flink allows various iteration methods by providing iteration control events • For example: in BSP mark begin and end of supersteps • includes novel optimization concepts: •
delta iterations: exploit sparse computational dependencies
red. join
map join
DATABASE SYSTEMS GROUP
API Examples
Word Count in Java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet
text = readTextFile (input); DataSet> counts= text .map (l ‐> l.split(“\\W+”)) .flatMap ((String[] tokens, Collector> out) ‐> { Arrays.stream(tokens) .filter(t ‐> t.length() > 0) .forEach(t ‐> out.collect(new Tuple2<>(t, 1))); }) .groupBy(0) .sum(1); env.execute("Word Count Example");
DATABASE SYSTEMS GROUP
API Examples
k-Means in Java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet points = getPointDataSet(params, env); DataSet centroids = getCentroidDataSet(params, env); IterativeDataSet loop = centroids.iterate(params.getInt("iterations", 10)); DataSet newCentroids = points.map(new SelectNearestCenter()).withBroadcastSet(loop, "centroids").map(new CountAppender()) .groupBy(0).reduce(new CentroidAccumulator()) .map(new CentroidAverager()); DataSet finalCentroids = loop.closeWith(newCentroids); DataSet> clusteredPoints = points .map(new SelectNearestCenter()).withBroadcastSet(finalCentroids, "centroids");
DATABASE SYSTEMS GROUP
References
• https://flink.apache.org/ • Carbone et. Al: Apache Flink: Stream and Batch Processing in a Sinlge Engine, IEEE Bulletin of the Technical Committee on Data Engineering, 2015 • Christian Boden: Introduction to Apache Flink, Technologie-Workshop „Big Data“ FZI Karlsruhe, 22. Juni 2015