Streaming in Apache Flinkup an environment to develop Flink programs • Implement streaming data processing pipelines • Flink managed state • Event time Streaming in Apache Flink • Streams are natural • Events of any type0 码力 | 45 页 | 3.00 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkScalable Stream Processing - Spark Streaming and Flink Amir H. Payberah payberah@kth.se 05/10/2018 The Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Spark streaming ▶ Flink 4 / 79 Spark Streaming 5 / 79 Contribution ▶ Design issues • Continuous vs. micro-batch processing • Record-at-a-Time vs. declarative APIs 6 / 79 Spark Streaming ▶ Run Run a streaming computation as a series of very small, deterministic batch jobs. • Chops up the live stream into batches of X seconds. • Treats each batch as RDDs and processes them using RDD operations0 码力 | 113 页 | 1.22 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020??? Vasiliki Kalavri | Boston University 2020 2 • Costs of streaming operator execution • state, parallelism, selectivity • Dataflow optimizations • plan translation alternatives • Runtime optimizations the basics 3 source sink input port output port dataflow graph ??? Vasiliki Kalavri | Boston University 2020 Revisiting the basics 4 Dataflow graph • operators are nodes, data channels are edges Pipeline: A || B Task: B || C Data: A || A ??? Vasiliki Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki Kalavri | Boston University 2020 9 Identify the most efficient0 码力 | 54 页 | 2.83 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/28: Graph Streaming ??? Vasiliki Kalavri | Boston University 2020 Modeling the world as a graph 2 Social networks a vertex and all of its neighbors. Although this model can enable a theoretical analysis of streaming algorithms, it cannot adequately model real-world unbounded streams, as the neighbors cannot be continuously generated as a stream of edges? • How can we perform iterative computation in a streaming dataflow engine? How can we propagate watermarks? • Do we need to run the computation from scratch0 码力 | 72 页 | 7.77 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri vkalavri@bu.edu CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/04: Streaming languages and operator semantics Vasiliki Kalavri | Boston University 2020 Vasiliki Kalavri | Boston interval of 5–15 s) by an item of type C with Z < 5. 8 Vasiliki Kalavri | Boston University 2020 Streaming Operators 9 Vasiliki Kalavri | Boston University 2020 Operator types (I) • Single-Item Operators println!("seen: {:?}", x)) .connect_loop(handle); }); t (t, l1) (t, (l1, l2)) Streaming Iteration Example Terminate after 100 iterations Create the feedback loop 13 Vasiliki Kalavri0 码力 | 53 页 | 532.37 KB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020relatively static and historical data • batched updates during downtimes, e.g. every night Streaming Data Warehouse • low-latency materialized view updates • pre-aggregated, pre-processed streams streams and historical data Data Management Approaches 4 storage analytics static data streaming data Vasiliki Kalavri | Boston University 2020 DBMS vs. DSMS DBMS DSMS Data persistent relations stream can be viewed as a massive, dynamic, one-dimensional vector A[1…N]. The size N of the streaming vector is defined as the product of the attribute domain size(s). Note that N might be unknown0 码力 | 45 页 | 1.22 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 20204/02: Elasticity policies and state migration ??? Vasiliki Kalavri | Boston University 2020 Streaming applications are long-running • Workload will change • Conditions might change • State is resources, spawn new processes or release unused resources, safely terminate processes • Adjust dataflow channels and network connections • Re-partition and migrate state in a consistent manner • Block problem 5 Given a logical dataflow with sources S1, S2, … Sn and rates λ1, λ2, … λn identify the minimum parallelism πi per operator i, such that the physical dataflow can sustain all source rates0 码力 | 93 页 | 2.42 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020High-availability and fault-tolerance in distributed stream processing • Recovery semantics and guarantees • Exactly-once processing in Apache Beam / Google Cloud Dataflow 2 Vasiliki Kalavri | Boston University <#StarWars, 300> Any non-trivial streaming computation maintains state: • rolling aggregations • window contents • input offsets • machine learning models State in dataflow computations 3 Vasiliki Kalavri <#Brexit> Any non-trivial streaming computation maintains state: • rolling aggregations • window contents • input offsets • machine learning models State in dataflow computations 3 Vasiliki Kalavri0 码力 | 49 页 | 2.08 MB | 1 年前3
Apache Flink的过去、现在和未来柏林工业大学博士生项目 • 基于流式 runtime 的批处理引擎 • 2014 年 8 月份 发布 Flink 0.6.0 Flink 0.7 Runtime Distributed Streaming Dataflow DataStream API Stream Processing DataSet API Batch Processing 2014 年 12 月份 发布 – 开始正式支持 6. Schedule Task YARN RM K8S RM 增量 Checkpoint 时间 全量状态 增量状态 增量 snapshot 基于 credit 的流控机制 Streaming SQL ------------------------- | USER_SCORES | ------------------------- | User | Score | Batch Processing Continuous Processing & Streaming Analytics Event-driven Applications ✔ 现在 Flink 1.9 的架构变化 Runtime Distributed Streaming Dataflow Query Processor DAG & StreamOperator Local0 码力 | 33 页 | 3.36 MB | 1 年前3
這些年,我們一起追的HadoopHortonworks Hadoop 簡史 5 / 74 The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It 成是一個中立的機制,方便支援各種不同 Framework。 23 / 74 YARN - Yet Another Resource Negotiator A General-Purpose Distributed Application Management Framework Data Operating System for Enterprise Hadoop 24 / 74 Resource 26 / 74 Hadoop 家族 27 / 74 Windows 家族 28 / 74 Hadoop 動物園 29 / 74 HDFS:Distributed File System MapReduce:Distributed Data Analysis Engine Avro:Language-Neutral Data Serialization System (2010-050 码力 | 74 页 | 45.76 MB | 1 年前3
共 323 条
- 1
- 2
- 3
- 4
- 5
- 6
- 33













