Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 9 Identify the most efficient way to execute a query • There may exist several ways to execute a computation • query plans, e.g. order of operators • scheduling and placement decisions • How can we estimate the cost of different strategies? • before execution or during runtime Query optimization (I) ??? Vasiliki Kalavri | Boston University 2020 10 Optimization strategies • enumerate • decrease latency, increase throughput • minimize monetary costs (if running in the cloud) Query optimization (II) ??? Vasiliki Kalavri | Boston University 2020 Cost-based optimization 11 Parsed0 码力 | 54 页 | 2.83 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 20205 ??? Vasiliki Kalavri | Boston University 2020 Load shedding as an optimization problem N: query network I: set of input streams with known arrival rates C: system processing capacity H: headroom continuously monitors input rates or other system metrics and can access information about the running query plan • It detects overload and decides what actions to take in order to maintain acceptable latency Fast approximate answers … S1 S2 Sr Input Manager Scheduler QoS Monitor Load Shedder Query Execution Engine Qm Q2 Q1 Ad-hoc or continuous queries Input streams … ??? Vasiliki Kalavri0 码力 | 43 页 | 2.42 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively high low• Derived stream: produced by a continuous query and its operators, e.g. total traffic from a source every minute ins_r(P:i) = insert(i, {j | j ∈ ins_r(P) ^ j.A ≠ i.A}). 28 Vasiliki Kalavri | Boston University 2020 Query processing challenges • Memory requirements: we cannot store the whole stream history. • Data rate: 0 码力 | 45 页 | 1.22 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020subscription • A logical producer/consumer can be implemented by multiple physical tasks running in parallel • Ιf a producer generates events with high rate, we can balance the load by spawning several Communication patterns (II) Fan-out Several logical consumers (possibly implemented by several parallel physical processes) can subscribe to the same topic, so that the message broker delivers messages search while MBs only offer topic-based subscription. • DB query results depend on a snapshot and clients are not notified if their query result changes later. 13 Message delivery and ordering Acknowledgements0 码力 | 33 页 | 700.14 KB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020approach? Vasiliki Kalavri | Boston University 2020 • Copy, checkpoint, restore, merge, split, query, subscribe, … State operations and types 4 Consider you are designing a state interface. What to an operator task, i.e. records processed by the same parallel task have access to the same state • It cannot be accessed by other parallel tasks of the same or different operators Keyed state is maintains one state instance. • The keyed state instances of a function are distributed across all parallel tasks of the function’s operator. Keyed state can only be used by functions that are applied0 码力 | 24 页 | 914.13 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020Restart it with an adjusted parallelism • The state is automatically redistributed to the new set of parallel tasks • For exactly-once results, we need to prevent a checkpoint to complete after the savepoint When scaling stateful operators, state needs to be repartitioned and assigned to more or fewer parallel tasks • Scaling different types of state • Operators with keyed state are scaled by repartitioning Existing state for a particular key and all future events with this key must be routed to the same parallel instance • Some kind of hashing is typically used • Maintaining routing tables or an index for0 码力 | 41 页 | 4.09 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 20205 1 4 ??? Vasiliki Kalavri | Boston University 2020 59 • Similar challenges exist for a data-parallel implementation of spanners • How to represent the spanner? As an adjacency list? which state every incoming edge? Can we compute the distances in separate partitions and then merge them? Data-parallel streaming spanners on Flink? ??? Vasiliki Kalavri | Boston University 2020 60 • McGregor, Andrew0 码力 | 72 页 | 7.77 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020using env.setParallelism() in your application. taskmanager.numberOfTaskSlots: The number of parallel operator or user function instances that a single TaskManager can run. This value is typically0 码力 | 26 页 | 3.33 MB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020applied on a keyed or a non-keyed stream: • Window operators on keyed windows are evaluated in parallel • Non-keyed windows are processed in a single thread To create a window operator, you need to0 码力 | 35 页 | 444.84 KB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 Active Standby 17 • The secondary receives tuples from upstream and processes them in parallel with the primary but it doesn’t output results • Watermarks are used to identify duplicate0 码力 | 49 页 | 2.08 MB | 1 年前3
共 19 条
- 1
- 2
相关搜索词













