Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 20204 Vasiliki Kalavri | Boston University 2020 Select IStream(*) From S1 [Rows 5], S2 [Rows 10] Where S1.A = S2.A Last 5 elements of stream S1 and last 10 elements of S2 stream-to-relation relation-to-relation Boston University 2020 Model and formalization (I) A stream is a sequence of unbounded length, where tuples are ordered by their arrival time. Sequence: Let t1, … ,tn be tuples from a relation R. that produced till step k. 27 Vasiliki Kalavri | Boston University 2020 A null operator N is one where N(S) = [ ] for every S. A non-null operator G is • blocking, when for every sequence S of length0 码力 | 53 页 | 532.37 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkH. Payberah payberah@kth.se 05/10/2018 The Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design Issues ▶ Continuous vs. micro-batch processing new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. ▶ countByValue • Returns a new DStream of (K, Long) pairs where the value of each key is its new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. ▶ countByValue • Returns a new DStream of (K, Long) pairs where the value of each key is its0 码力 | 113 页 | 1.22 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020When to shed load? • detect overload quickly to avoid latency increase • monitor input rates • Where in the query plan? • dropping at the sources vs. dropping at bottleneck operators • How much load Kalavri | Boston University 2020 Reacting to overload • Where in the query plan to drop tuples, which tuples, and how many • The question of where is equivalent to placing special drop operators in the known aggregation functions, results can be scaled using approximate query processing techniques, where accuracy is measured in terms of relative error in the computed query answers. 17 ??? Vasiliki0 码力 | 43 页 | 2.42 MB | 1 年前3
PyFlink 1.15 Documentationsummarizes the basic steps required to setup and get started with PyFlink. There are live notebooks where you can try PyFlink out without any other step: • Live Notebook: Table • Live Notebook: DataStream PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible and useful when it’s not the Python environments in advance on the cluster nodes or when there are some special requirements where the pre-installed Python environments could not meet. ./bin/flink run \ --jobmanager:8081 0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationsummarizes the basic steps required to setup and get started with PyFlink. There are live notebooks where you can try PyFlink out without any other step: • Live Notebook: Table • Live Notebook: DataStream PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible and useful when it’s not the Python environments in advance on the cluster nodes or when there are some special requirements where the pre-installed Python environments could not meet. ./bin/flink run \ --jobmanager:8081 0 码力 | 36 页 | 266.80 KB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020IP connections that are currently active 10 The vector is updated by a continuous stream events where the jth update has the general form (k, c[j]) and modifies the kth entry of A with the operation e.g. a sequence of (finite) relation states over a common schema R: [r1(R), r2(R), ..., ], where the individual relations are unordered sets. src dest bytes 1 2 20K 2 5 32K 1 2 28K {(1, 2, 20K) • as the concatenation of serializations of the relations. • as a list of tuple-index pairs, whereindicates that t ∈ rj • as a serialization of r1 followed by a series of delta tuples that 0 码力 | 45 页 | 1.22 MB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020At all times, we want the following property to hold: an element is in S with probability s/n, where n is the total number of stream elements seen so far. ??? Vasiliki Kalavri | Boston University 2020 At all times, we want the following property to hold: an element is in S with probability s/n, where n is the total number of stream elements seen so far. As if we could keep all n elements and at 24 • A bit array of size n, where n is generally higher than the expected number of elements in the input • k independent and uniformly distributed hash functions, where k << n The Bloom filter n0 码力 | 74 页 | 1.06 MB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020stocks priced between $20 and $200, where the spread between the high tick and the low tick over the past 30 minutes is greater than 3% of the last price, and where in the last 5 minutes the average0 码力 | 34 页 | 2.53 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Boston University 2020 Let h be a hash function that maps each stream element into M = log2N bits, where N is the domain of input elements: For each element x, let rank(x) be the number of 0s in the (i0i1 . . . iM−1)2, ik ∈ {0,1} j = (i0i1 . . . ip−1)2 For we select one of m counters COUNT[j], where ??? Vasiliki Kalavri | Boston University 2020 11 Stochastic averaging: example Let M = 5, p = 20 码力 | 69 页 | 630.01 KB | 1 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 20204 Vasiliki Kalavri | Boston University 2020 • Processing time • the time of the local clock where an event is being processed • a processing-time window wouldn’t account for game activity while0 码力 | 22 页 | 2.22 MB | 1 年前3
共 14 条
- 1
- 2













