Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020does efficient mean in the context of streaming? • queries run continuously • streams are unbounded • In traditional ad-hoc database queries, the query plan is generated on- the-fly. Different plans University 2020 22 • Multi-tenancy • in streaming systems that build one dataflow graph for several queries • when applications analyze data streams from a small set of sources • Operator elimination (SOSP ’13). • Fabian Hueske, and Vasiliki Kalavri. Stream Processing with Apache Flink. (O’Reilly Media ’19). Lecture references ??? Vasiliki Kalavri | Boston University 2020 54 • Re-ordering • Shivnath0 码力 | 54 页 | 2.83 MB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020• Queries are executed against the synopsis rather than the entire dataset. 2 Synopsis: a lossy, compact summary of the input stream input stream synopsis maintenance component user queries approximate compute the statistical variance of this series? 3 Can this synopsis be used to answer general queries? • the sum of all the values • the sum of the squares of the values • the number of observations ?? Vasiliki Kalavri | Boston University 2020 Synopses provide accurate estimations • For many queries, an exact answer would require storing and analyzing the entire dataset • Instead, we can relax0 码力 | 74 页 | 1.06 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 Three classes of operators: • relation-to-relation: similar to standard SQL and define queries over tables. • stream-to-relation: define tables by selecting portions of a stream. • relation-to-stream: in R at time τ. 6 Vasiliki Kalavri | Boston University 2020 Imperative language: Aurora SQuAl Queries are represented in graphical representation using boxes and arrows Tumble Window Tumble Window 2020 What kind of queries can we express and support on data streams? 21 Vasiliki Kalavri | Boston University 2020 Non-blocking (monotonic) queries are the only continuous queries that can be supported0 码力 | 53 页 | 532.37 KB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020Scheduler QoS Monitor Load Shedder Query Execution Engine Qm Q2 Q1 Ad-hoc or continuous queries Input streams … ??? Vasiliki Kalavri | Boston University 2020 Load shedding decisions • When the source avoids wasting work but it might affect results of multiple queries if the source is connected to multiple queries. 14 ??? Vasiliki Kalavri | Boston University 2020 Load Shedding Road Map can lead to system instability or unnecessary load shedding. • In window-aware load shedding, queries need to define a batch size: an application-specific maximum tolerance to gaps. • This parameter0 码力 | 43 页 | 2.42 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous queries • sequential data access relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively high low 5 Vasiliki Kalavri | Boston University 2020 Traditional stream on integers? • The number of distinct users who have visited a website? • The top-10 queries inserted in a search engine? • The connected components of accounts in a stream of financial0 码力 | 45 页 | 1.22 MB | 1 年前3
Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020the resources of the targeted system by sending a large number of query from a botnet • Group queries by their top-level domain and investigate most popular domains • Alert if we detect many different0 码力 | 69 页 | 630.01 KB | 1 年前3
PyFlink 1.15 Documentationfunction management: User-defined function registration, dropping, listing, etc. • Executing SQL queries • Job configuration • Python dependency management • Job submission For more details of how to0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationfunction management: User-defined function registration, dropping, listing, etc. • Executing SQL queries • Job configuration • Python dependency management • Job submission For more details of how to0 码力 | 36 页 | 266.80 KB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020from the following sources: • Martin Kleppmann. Designing data-intensive applications (O’Reilly Media) • Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec. The many0 码力 | 33 页 | 700.14 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flink/ 79 Summary 77 / 79 References ▶ M. Zaharia et al., “Spark: The Definitive Guide”, O’Reilly Media, 2018 - Chapters 20-23. ▶ M. Zaharia et al., “Discretized Streams: An Efficient and Fault-Tolerant0 码力 | 113 页 | 1.22 MB | 1 年前3
共 11 条
- 1
- 2













