Streaming optimizations	- CS 591 K1: Data Stream Processing and Analytics Spring 2020operators • scheduling and placement decisions • different algorithms, e.g. hash-based vs. broadcast join • What does performance depend on? • input data, intermediate data • operator properties • applying B and then A. • holds if both operators are stateless Re-ordering split and merge split merge merge split merge split When might this be beneficial? ??? Vasiliki Kalavri | Boston University 2020 operations are commutative • theta-join operations are commutative • natural joins are associative • Move projections early to reduce data item size • Pick join orderings to minimize the size of intermediate0 码力 | 54 页 | 2.83 MB | 1 年前3
 Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020are represented in graphical representation using boxes and arrows Tumble Window Tumble Window Join(S1.A = S2.A) S1 S2 7 Vasiliki Kalavri | Boston University 2020 Composite subscription pattern was active 17 Vasiliki Kalavri | Boston University 2020 Flow Management Operators (I) • Join operators merge two streams by matching elements satisfying a condition • commonly applied on windows0 码力 | 53 页 | 532.37 KB | 1 年前3
 Scalable Stream Processing - Spark Streaming and Flinkof (K, Long) pairs where the value of each key is its frequency within a sliding window. 26 / 79 Join Operation (1/3) ▶ Stream-stream joins ▶ In each batch interval, the RDD generated by stream1 will String] = ... val stream2: DStream[String, String] = ... val joinedStream = stream1.join(stream2) 27 / 79 Join Operation (2/3) ▶ Stream-stream joins ▶ Joins over windows of the streams. val windowedStream1 val windowedStream2 = stream2.window(Minutes(1)) val joinedStream = windowedStream1.join(windowedStream2) 28 / 79 Join Operation (3/3) ▶ Stream-dataset joins val dataset: RDD[String, String] = ... val0 码力 | 113 页 | 1.22 MB | 1 年前3
 Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020accumulator and return it. OUT getResult(ACC accumulator); // merge two accumulators and return the result. ACC merge(ACC a, ACC b); } 16 AggregateFunction interface Vasiliki Kalavri | override def getResult(acc: (String, Double, Int)) = { (acc._1, acc._2 / acc._3) } override def merge(acc1: (String, Double, Int), acc2: (String, Double, Int)) = { (acc1._1, acc1._2 + acc2._2 type Accumulator type Output type Initialization Accumulate one element Compute the result Merge two partial accumulators Vasiliki Kalavri | Boston University 2020 Use the ProcessWindowFunction0 码力 | 35 页 | 444.84 KB | 1 年前3
 Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020the 1st time, create a component with ID the min of the vertex IDs • if in different components, merge them and update the component ID to the min of the component IDs • if only one of the endpoints the edge stream, e.g. by source Id 2. maintain a disjoint set in each partition 3. periodically merge the partial disjoint sets into a global one ??? Vasiliki Kalavri | Boston University 2020 Connected do that for every incoming edge? Can we compute the distances in separate partitions and then merge them? Data-parallel streaming spanners on Flink? ??? Vasiliki Kalavri | Boston University 20200 码力 | 72 页 | 7.77 MB | 1 年前3
 State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020disadvantages of each approach? Vasiliki Kalavri | Boston University 2020 • Copy, checkpoint, restore, merge, split, query, subscribe, … State operations and types 4 Consider you are designing a state interface Iterator/RangeScan: seek to a specified key and then scan one key at a time from that point (keys are sorted) • Merge: a lazy read-modify-write RocksDB 11 Vasiliki Kalavri | Boston University 2020 In conf/flink.conf0 码力 | 24 页 | 914.13 KB | 1 年前3
 Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020operator produce per record in its input? • map: 1 in 1 out • filter: 1 in, 1 or 0 out • flatMap, join: 1 in 0, 1, or more out • Cost: how many records can an operator process in a unit of time? #records_in0 码力 | 43 页 | 2.42 MB | 1 年前3
 Flink如何实时分析Iceberg数据湖的CDC数据。 4、不支持增量SF。 h点 直接D入CDC到Hi2+分析 、流程能E作 2、Hi2+存量数据不受增量数据H响。 方案评估 优点 、数据不是CR写入; 2、每次数据D致都要 MERGE 存量数据 。T+ 方GT新3R效性差。 3、不M持CR1ps+rt。 缺点 SCaDk + )=AFa IL()(数据 MER,E .NTO GE=DE US.N, chan>=E ON0 码力 | 36 页 | 781.69 KB | 1 年前3
 PyFlink 1.15 DocumentationTableEnvironment. It is not possible to combine tables from different TableEnvironments in same query, e.g., to join or union them. Firstly, you can create a Table from a Python List Object [3]: table = table_env0 码力 | 36 页 | 266.77 KB | 1 年前3
 PyFlink 1.16 DocumentationTableEnvironment. It is not possible to combine tables from different TableEnvironments in same query, e.g., to join or union them. Firstly, you can create a Table from a Python List Object [3]: table = table_env0 码力 | 36 页 | 266.80 KB | 1 年前3
共 10 条
- 1
 













