Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020J34WetX1aOXxSJ4Kok I6q3oxWMLxhbaUDbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+4UZgO1VI41BgKxzdTv3WEyrNE3lvxikGMR1IHnFGjZWaca9SdWvuDGSZeAWp QoFGr/LV7 J34WetX1aOXxSJ4Kok I6q3oxWMLxhbaUDbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+4UZgO1VI41BgKxzdTv3WEyrNE3lvxikGMR1IHnFGjZWaca9SdWvuDGSZeAWp QoFGr/LV7 J34WetX1aOXxSJ4Kok I6q3oxWMLxhbaUDbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+4UZgO1VI41BgKxzdTv3WEyrNE3lvxikGMR1IHnFGjZWaca9SdWvuDGSZeAWp QoFGr/LV70 码力 | 81 页 | 13.18 MB | 1 年前3
Flink如何实时分析Iceberg数据湖的CDC数据+arquet、Avro、Orcn。 t点 A3a/21 Kudu 维护 CDC 数据p 、支持L时更新数据,时效性佳。 2、CK加速,适合OLAP分析。 方案评估 优点 、cedKudup群,a较小众。维护 O本q。 2、H HDFS / S3 / OSS 等D裂。数据c e,且KAO本不如S3 / OSS。 3、Kudud批量P描不如3ar4u1t。 4、不支持增量SF。 h点 直接D入CDC到Hi2+分析 、流程能E作 2、Hi2+存量数据不受增量数据H响。 方案评估 优点 、数据不是CR写入; 2、每次数据D致都要 MERGE 存量数据 。T+ 方GT新3R效性差。 3、不M持CR1ps+rt。 缺点 SCaDk + )=AFa IL()(数据 MER,E .NTO GE=DE US.N, chan>=E ON GE=DE.GE=D.< = chan>=E THEN .NSERT (GE=D.<, a<=E.GE=D.<, chan>=E.a< t S1a2k+D+/4a,架构简e。 2、无在k服务。l护和运nS本低。 2、D存存储,Ca速O快。 3、方便上S3 OSS,超高性价比。 方案s估 优点 1、增量和全量表割p,时效性不足。 2、r计和l护额外hChang+ 0 码力 | 36 页 | 781.69 KB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020failure Precise t1 t2 t3 t4 t5 t6 … Gap t1 t2 t3 t5 t6 … Rollback-repeating t1 t2 t3 t2 t3 t4 … Rollback-convergent t1 t2 t3 t’2 t’3 t4 … Rollback-divergent t1 t2 t3 t’2 t’3 t’4 … The output secondary receives tuples from upstream and processes them in parallel with the primary but it doesn’t output results • Watermarks are used to identify duplicate output tuples and trim the secondary’s definitely isn’t Vasiliki Kalavri | Boston University 2020 21 http://streamingbook.net/fig/5-5 Bloom filter: if true, the element is probably in the set if false, it definitely isn’t Separate bloom0 码力 | 49 页 | 2.08 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020events? 48 t t+1 t+3 t+4 t+5 t+6 t+7 t+2 3 events 4 events 2 events? How would you compute… • the maximum every 100 events? • clicks per user session? 49 t t+1 t+3 t+4 t+5 t+6 t+7 t+2 logged in per user session? • faster than the batch size? • alerts when patterns occur? 50 t t+1 t+3 t+4 t+5 t+6 t+7 t+2 How would you compute… ??? Vasiliki Kalavri | Boston University 2020 51 • TaskManagers0 码力 | 54 页 | 2.83 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 • ValueState[T]: a single value of type T • ValueState.value() • ValueState.update(value: T) • ListState[T]: a list of elements of type T • ListState.add(value: T) • ListState.addAll(values: addAll(values: java.util.List[T]). • List State.get(): Iterable[T] • ListState.update(values: java.util.List[T]) Flink’s state primitives 13 Vasiliki Kalavri | Boston University 2020 • MapState[K, V]: a over the contained entries, keys, and values • ReducingState[T]: aggregates values using a ReduceFunction • ReducingState.add(value: T) • ReducingState.get() • AggregatingState[I, O]: aggregates0 码力 | 24 页 | 914.13 KB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 7 Let G(t) = (V(t), E(t)) be the graph observed up to timestamp t. For t=0, V(t) = E(t) = {} For every t > 0, we receive one event: • Insert-only edge stream: or deletions A t+1, the graph is obtained by inserting a new edge or deleting an existing edge (u, v) to E(t+1). If any of u, v do not already exist in V(t), they are added to V(t+1). Preliminaries • Edge endpoints must have different signs • When merging components, if flipping all signs doesn’t work => the graph is not bipartite Bipartite graph checking ??? Vasiliki Kalavri | Boston University0 码力 | 72 页 | 7.77 MB | 1 年前3
PyFlink 1.15 Documentationjob, it could choose one of these Python virtual environments to use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m requirements where the pre-installed Python environments could not meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m deployment. You could execute PyFlink jobs in application mode as following: ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationjob, it could choose one of these Python virtual environments to use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m requirements where the pre-installed Python environments could not meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m deployment. You could execute PyFlink jobs in application mode as following: ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m0 码力 | 36 页 | 266.80 KB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020alternatively, events can have validity intervals. • The contents of the relation at time t are all events with Vs ≤ t . Vasiliki Kalavri | Boston University 2020 Types of streams • Base stream: produced 64K <t1, 16.2.3.7, 10.1.0.2, 20K> <t2, 13.5.6.7, 12.4.0.3, 32K> <t3, 16.2.3.7, 11.8.6.2, 28K> 17 append… … … … new events old events R(t1) R(t2) R(t3) R(tk) concatenation of serializations of the relations. • as a list of tuple-index pairs, where <t, j> indicates that t ∈ rj • as a serialization of r1 followed by a series of delta tuples that indicate updates 0 码力 | 45 页 | 1.22 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020.inspect(|x| println!("seen: {:?}", x)) .connect_loop(handle); }); t (t, l1) (t, (l1, l2)) Streaming Iteration Example Terminate after 100 iterations Create the feedback Vasiliki Kalavri | Boston University 2020 Window types (II) • Fixed windows have bound which don’t move • events received between 1/1/2019 and 12/1/2019 • Landmark windows have a fixed lower bound Sequence: Let t1, … ,tn be tuples from a relation R. The list S = [t1, … ,tn] is called a sequence, of length n, of tuples from R. The empty sequence [ ] has length 0. We use t ∈ S to denote that0 码力 | 53 页 | 532.37 KB | 1 年前3
共 20 条
- 1
- 2













