Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020to keep all users in memory? ??? Vasiliki Kalavri | Boston University 2020 We can use a hash function h to hash the user name (or IP) and select queries only when h(user) = 0. 13 In general: We can b1, …, b9. • select the query if the user hash value is in b0, b1, or b2. ??? Vasiliki Kalavri | Boston University 2020 We can use a hash function h to hash the user name (or IP) and select queries example, to get a 30% sample: • use 10 buckets, b0, b1, …, b9. • select the query if the user hash value is in b0, b1, or b2. How can we limit the sample size from growing indefinitely? ??? Vasiliki0 码力 | 74 页 | 1.06 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 20203 Example use-case: Distinct users visiting one or multiple webpages Naive solution: maintain a hash table ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements or multiple webpages Naive solution: maintain a hash table Convert the stream into a multi-set of uniformly distributed random numbers using a hash function. ??? Vasiliki Kalavri | Boston University visiting one or multiple webpages Naive solution: maintain a hash table The more different elements we encounter in the stream, the more different hash values we shall see. Convert the stream into a multi-set0 码力 | 69 页 | 630.01 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020ring using the same hash function. Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 n1 n3 n2 0 2128 Nodes and data are mapped to a ring using the same hash function. ei:| Boston University 2020 n1 n3 n2 0 2128 Nodes and data are mapped to a ring using the same hash function. ei: h ek: h Consistent hashing ??? Vasiliki Kalavri | Boston University nodes. n4 In practice, each node is mapped to multiple points on the ring using multiple hash functions. Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 n1 n3 n2 0 2128 When 0 码力 | 41 页 | 4.09 MB | 1 年前3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020mitigation ??? Vasiliki Kalavri | Boston University 2020 Key partitioning 2 w2 w1 w3 round-robin hash-based • Items are perfectly balanced among workers • No routing table required • Key semantics workers at random and send the item to the least loaded of those two • the system uses two hash functions, H1 and H2 and checks the load of the two sampled workers: P(k) = arg mini(Li(t): H1(k)=i ∨ state needs to be merged to produce the final result: the computation must consist of combinable functions • workers need to be able to compute their current load locally 12 ??? Vasiliki Kalavri | Boston0 码力 | 31 页 | 1.47 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020plans, e.g. order of operators • scheduling and placement decisions • different algorithms, e.g. hash-based vs. broadcast join • What does performance depend on? • input data, intermediate data0 码力 | 54 页 | 2.83 MB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 Window functions define the computation that is performed on the elements of a window • Incremental aggregation functions are applied when an element is added to a the aggregated value as the result. • ReduceFunction and AggregateFunction • Full window functions collect all elements of a window and iterate over the list of all collected elements when evaluated: evaluated: • They require more space but support more complex logic. • ProcessWindowFunction Window functions 14 Vasiliki Kalavri | Boston University 2020 val minTempPerWindow: DataStream[(String, Double)]0 码力 | 35 页 | 444.84 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkcheckpoint("path/to/persistent/storage") 45 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: ▶ updateStateByKey • It is executed on the whole range of keys in DStream proportional to the size of the batch. 46 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: ▶ updateStateByKey • It is executed on the whole range of keys in DStream proportional to the size of the batch. 46 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: ▶ updateStateByKey • It is executed on the whole range of keys in DStream0 码力 | 113 页 | 1.22 MB | 1 年前3
PyFlink 1.15 DocumentationPyFlink supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & GenericJdbcSinkFunction. ˓→open(GenericJdbcSinkFunction.java:52) at org.apache.flink.api.common.functions.util.FunctionUtils. ˓→openFunction(FunctionUtils.java:34) at org.apache.flink.streaming.api.operators0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationPyFlink supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & GenericJdbcSinkFunction. ˓→open(GenericJdbcSinkFunction.java:52) at org.apache.flink.api.common.functions.util.FunctionUtils. ˓→openFunction(FunctionUtils.java:34) at org.apache.flink.streaming.api.operators0 码力 | 36 页 | 266.80 KB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020distributed across all parallel tasks of the function’s operator. Keyed state can only be used by functions that are applied on a KeyedStream: • When the processing method of a function with keyed input0 码力 | 24 页 | 914.13 KB | 1 年前3
共 14 条
- 1
- 2













