Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020table Convert the stream into a multi-set of uniformly distributed random numbers using a hash function. ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements multi-set of uniformly distributed random numbers using a hash function. ??? Vasiliki Kalavri | Boston University 2020 Let h be a hash function that maps each stream element into M = log2N bits, where N number of distinct elements is equal to: ??? Vasiliki Kalavri | Boston University 2020 The hash function h hashes x to any of N values with probability 1/N. Out of all x we hash: • around 50% will0 码力 | 69 页 | 630.01 KB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020produces a WindowedStream (or All WindowedStream if applied on a non-keyed DataStream). • A window function is applied on a WindowedStream (or AllWindowedStream) and processes the elements assigned to a specify the window function // define a non-keyed window-all operator stream .windowAll(...) // specify the window assigner .reduce/aggregate/process(...) // specify the window function 6 Keyed vs. non-keyed seconds(1)) .process(new TemperatureAverager) 9 Tumbling window example Window assigner Window function Vasiliki Kalavri | Boston University 2020 overlapping buckets of fixed size fixed length slide0 码力 | 35 页 | 444.84 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flink(2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 0 or more output items. ▶ filter (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 0 or more output items. ▶ filter (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 0 or more output items. ▶ filter0 码力 | 113 页 | 1.22 MB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020need to keep all users in memory? ??? Vasiliki Kalavri | Boston University 2020 We can use a hash function h to hash the user name (or IP) and select queries only when h(user) = 0. 13 In general: We hash value is in b0, b1, or b2. ??? Vasiliki Kalavri | Boston University 2020 We can use a hash function h to hash the user name (or IP) and select queries only when h(user) = 0. 13 In general: We in the input • k independent and uniformly distributed hash functions, where k << n The Bloom filter n bits h1 h2 hk … k hash functions ??? Vasiliki Kalavri | Boston University 2020 25 for i=1 to0 码力 | 74 页 | 1.06 MB | 1 年前3
PyFlink 1.15 Documentationlisting Tables, Conversion between Table and DataStream, etc. • User-defined function management: User-defined function registration, dropping, listing, etc. • Executing SQL queries • Job configuration data 0 1 Hi Applying a Function on Table PyFlink supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations RecordBatch.from_arrays(arrays, schema) [21]: _c0 0 2 1 3 [ ]: # use the Python function in SQL API table_env.create_temporary_function("plus_one", plus_one) table_env.sql_query("SELECT plus_one(id) FROM {}"0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationlisting Tables, Conversion between Table and DataStream, etc. • User-defined function management: User-defined function registration, dropping, listing, etc. • Executing SQL queries • Job configuration data 0 1 Hi Applying a Function on Table PyFlink supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations RecordBatch.from_arrays(arrays, schema) [21]: _c0 0 2 1 3 [ ]: # use the Python function in SQL API table_env.create_temporary_function("plus_one", plus_one) table_env.sql_query("SELECT plus_one(id) FROM {}"0 码力 | 36 页 | 266.80 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020communication: load in terms of flow size in the input channel of each parallel task • Partitioning function performance • space required to implement routing • lookup cost • Migration performance using the same hash function. Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 n1 n3 n2 0 2128 Nodes and data are mapped to a ring using the same hash function. ei:h Consistent Boston University 2020 n1 n3 n2 0 2128 Nodes and data are mapped to a ring using the same hash function. ei: h ek: h Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 0 码力 | 41 页 | 4.09 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020the state via a Flink program. • The keys are ordered according to a user-specified comparator function. Basic operations • Get(key): fetch a single key-value from the DB • Put(key, val): insert the state and the data types of the state: • The state name is scoped to the operator so that a function can have more than one state object by registering multiple state descriptors. • The data types instance. • The keyed state instances of a function are distributed across all parallel tasks of the function’s operator. Keyed state can only be used by functions that are applied on a KeyedStream:0 码力 | 24 页 | 914.13 KB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Boston University 2020 Window types (I) • Time-based (logical) windows define their contents as a function of time. • average price of items bought within the last 5 minutes • Count-based (physical) downstream operators. • Group by / Partition Operators split a stream into sub-streams according to a function or the event contents. • one stream per customer Id • round-robin assignment 19 Vasiliki Kalavri 28 Vasiliki Kalavri | Boston University 2020 What functions on streams can be expressed using non-blocking operators? Proposition: A function F(S) on a sequence S can be computed using a non- blocking0 码力 | 53 页 | 532.37 KB | 1 年前3
Streaming in Apache FlinkInteger.toString(this.startCell) + "," + Integer.toString(this.endCell); } } Map Function DataStreamrides = env.addSource(new TaxiRideSource(...)); DataStream map(TaxiRide taxiRide) throws Exception { return new EnrichedRide(taxiRide); } } FlatMap Function public static class NYCEnrichment implements FlatMapFunction { @Override your cluster grows and shrinks • queryable: Flink state can be queried via a REST API Rich Functions • open(Configuration c) • close() • getRuntimeContext() DataStream > 0 码力 | 45 页 | 3.00 MB | 1 年前3
共 15 条
- 1
- 2













