Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020disadvantages of each representation? 27 Vasiliki Kalavri | Boston University 2020 Reconstitution functions Insert (append-only): The reconstitution function ins starts with an empty bag and then inserts single pass over streaming tuples in their arrival order • Small space: memory footprint poly-logarithmic in the stream size • Low time: fast update and query times • Delete-proof: synopses can handle0 码力 | 45 页 | 1.22 MB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020in the input • k independent and uniformly distributed hash functions, where k << n The Bloom filter n bits h1 h2 hk … k hash functions ??? Vasiliki Kalavri | Boston University 2020 25 for i=1 to filter Adding an element to the filter 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 n bits h1 h2 hk … k hash functions stream elements x The empty filter is initialized to all 0s ??? Vasiliki Kalavri | Boston Testing if an element is in the filter 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 n bits h1 h2 hk … k hash functions test element x If all bits are set, the element may exist in the set. If at least one element0 码力 | 74 页 | 1.06 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020to use multiple hash functions and combine their estimates: • Using many hash functions for a high-rate stream is expensive • Finding many random and independent hash functions is difficult ??? Vasiliki unset the corresponding bit of the counter falls to 0 • A single array of counters for all hash functions increases the collision probability • Counter overestimation is almost certain for very large update different subsets of counters, one per hash table • Many independent trials by using p hash functions with an array of m counters for each of them The Count-Min Sketch ??? Vasiliki Kalavri | Boston0 码力 | 69 页 | 630.01 KB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 Window functions define the computation that is performed on the elements of a window • Incremental aggregation functions are applied when an element is added to a the aggregated value as the result. • ReduceFunction and AggregateFunction • Full window functions collect all elements of a window and iterate over the list of all collected elements when evaluated: evaluated: • They require more space but support more complex logic. • ProcessWindowFunction Window functions 14 Vasiliki Kalavri | Boston University 2020 val minTempPerWindow: DataStream[(String, Double)]0 码力 | 35 页 | 444.84 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020nodes. n4 In practice, each node is mapped to multiple points on the ring using multiple hash functions. Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 n1 n3 n2 0 2128 When a nodes. n4 In practice, each node is mapped to multiple points on the ring using multiple hash functions. Why? Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 • It ensures state is flink.apache.org/features/2017/07/04/flink-rescalable-state.html • Buğra Gedik. Partitioning functions for stateful data parallelism in stream processing. (VLDB Journal 23, 4, 2014). 34 Lecture0 码力 | 41 页 | 4.09 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkcheckpoint("path/to/persistent/storage") 45 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: ▶ updateStateByKey • It is executed on the whole range of keys in DStream proportional to the size of the batch. 46 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: ▶ updateStateByKey • It is executed on the whole range of keys in DStream proportional to the size of the batch. 46 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: ▶ updateStateByKey • It is executed on the whole range of keys in DStream0 码力 | 113 页 | 1.22 MB | 1 年前3
PyFlink 1.15 DocumentationPyFlink supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & GenericJdbcSinkFunction. ˓→open(GenericJdbcSinkFunction.java:52) at org.apache.flink.api.common.functions.util.FunctionUtils. ˓→openFunction(FunctionUtils.java:34) at org.apache.flink.streaming.api.operators0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationPyFlink supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & GenericJdbcSinkFunction. ˓→open(GenericJdbcSinkFunction.java:52) at org.apache.flink.api.common.functions.util.FunctionUtils. ˓→openFunction(FunctionUtils.java:34) at org.apache.flink.streaming.api.operators0 码力 | 36 页 | 266.80 KB | 1 年前3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020workers at random and send the item to the least loaded of those two • the system uses two hash functions, H1 and H2 and checks the load of the two sampled workers: P(k) = arg mini(Li(t): H1(k)=i ∨ H2(k)=i) state needs to be merged to produce the final result: the computation must consist of combinable functions • workers need to be able to compute their current load locally 12 ??? Vasiliki Kalavri | Boston0 码力 | 31 页 | 1.47 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020distributed across all parallel tasks of the function’s operator. Keyed state can only be used by functions that are applied on a KeyedStream: • When the processing method of a function with keyed input0 码力 | 24 页 | 914.13 KB | 1 年前3
共 13 条
- 1
- 2













