 State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 2/25: State Management Vasiliki Kalavri | Boston University 2020 Logic State State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 2/25: State Management Vasiliki Kalavri | Boston University 2020 Logic State- <#Brexit, 520> <#WorldCup, 480> Keyed state is scoped to a key defined in the operator’s input records • Flink maintains one state instance per key value and partitions all records with the same key to the operator task that maintains maintains the state for this key • State access is automatically scoped to the key of the current record so that all records with the same key access the same state State management in Apache Flink 5 Vasiliki 0 码力 | 24 页 | 914.13 KB | 1 年前3
 Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020DBMS SDW DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous materialized view updates • pre-aggregated, pre-processed streams and historical data Data Management Approaches 4 storage analytics static data streaming data Vasiliki Kalavri | Boston University make to obtain r2, r3, ..., etc. • as a replacement sequence where some attribute A denotes a key and an arriving tuple t replaces any existing tuple with the same t(A) value to form a new relation0 码力 | 45 页 | 1.22 MB | 1 年前3 Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020DBMS SDW DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions of single row or groups of rows Data Stream Management System • continuous materialized view updates • pre-aggregated, pre-processed streams and historical data Data Management Approaches 4 storage analytics static data streaming data Vasiliki Kalavri | Boston University make to obtain r2, r3, ..., etc. • as a replacement sequence where some attribute A denotes a key and an arriving tuple t replaces any existing tuple with the same t(A) value to form a new relation0 码力 | 45 页 | 1.22 MB | 1 年前3
 Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020migration if the state is large • Progressive • move state to be migrated in smaller pieces, e.g. key-by-key • can be used to interleave state transfer with processing • migration duration might increase State is scoped to a single task • Each stateful task is responsible for processing and state management 31 ??? Vasiliki Kalavri | Boston University 2020 Pause-and-restart state migration • State State is scoped to a single task • Each stateful task is responsible for processing and state management 31 block channels and upstream operators ??? Vasiliki Kalavri | Boston University 2020 Pause-and-restart0 码力 | 93 页 | 2.42 MB | 1 年前3 Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020migration if the state is large • Progressive • move state to be migrated in smaller pieces, e.g. key-by-key • can be used to interleave state transfer with processing • migration duration might increase State is scoped to a single task • Each stateful task is responsible for processing and state management 31 ??? Vasiliki Kalavri | Boston University 2020 Pause-and-restart state migration • State State is scoped to a single task • Each stateful task is responsible for processing and state management 31 block channels and upstream operators ??? Vasiliki Kalavri | Boston University 2020 Pause-and-restart0 码力 | 93 页 | 2.42 MB | 1 年前3
 Streaming optimizations	- CS 591 K1: Data Stream Processing and Analytics Spring 2020optimizations • plan translation alternatives • Runtime optimizations • load management, scheduling, state management • Optimization semantics, correctness, profitability Topics covered in this lecture have seen • windows, continuous aggregations, distinct… • State is commonly partitioned by key • State can be cleared based on watermarks or punctuations • window fires, post becomes inactive map(String key, String value): // key: document name // value: document contents for each URL u in value: EmitIntermediate(u, "1"); reduce(String key, Iterator values): // key: a URL0 码力 | 54 页 | 2.83 MB | 1 年前3 Streaming optimizations	- CS 591 K1: Data Stream Processing and Analytics Spring 2020optimizations • plan translation alternatives • Runtime optimizations • load management, scheduling, state management • Optimization semantics, correctness, profitability Topics covered in this lecture have seen • windows, continuous aggregations, distinct… • State is commonly partitioned by key • State can be cleared based on watermarks or punctuations • window fires, post becomes inactive map(String key, String value): // key: document name // value: document contents for each URL u in value: EmitIntermediate(u, "1"); reduce(String key, Iterator values): // key: a URL0 码力 | 54 页 | 2.83 MB | 1 年前3
 Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020minimize performance disruption, e.g. latency spikes • avoid introducing load imbalance • Resource management • utilization, isolation • Automation • continuous monitoring • bottleneck detection location for each key in the checkpoint, so that tasks locate and read the matching keys only • Avoids reading irrelevant data • Requires a materialized index for all keys, i.e. a key-to-read-offset Reconfiguring keyed stateful operators requires preserving the key semantics: • Existing state for a particular key and all future events with this key must be routed to the same parallel instance • Some0 码力 | 41 页 | 4.09 MB | 1 年前3 Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020minimize performance disruption, e.g. latency spikes • avoid introducing load imbalance • Resource management • utilization, isolation • Automation • continuous monitoring • bottleneck detection location for each key in the checkpoint, so that tasks locate and read the matching keys only • Avoids reading irrelevant data • Requires a materialized index for all keys, i.e. a key-to-read-offset Reconfiguring keyed stateful operators requires preserving the key semantics: • Existing state for a particular key and all future events with this key must be routed to the same parallel instance • Some0 码力 | 41 页 | 4.09 MB | 1 年前3
 Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020in a period during which a user was active 17 Vasiliki Kalavri | Boston University 2020 Flow Management Operators (I) • Join operators merge two streams by matching elements satisfying a condition blocking and must be defined over a window 18 Vasiliki Kalavri | Boston University 2020 Flow Management Operators (II) • Duplicate/Copy Operator replicates a stream, commonly to be used as input to Z.Event = ‘cancel’ AND Z.ItemID = Y.ItemID Partitions the stream into substreams according to a key A sequence of events that immediately follow one another AS PATTERN (X V* Y W* Z) • Match zero0 码力 | 53 页 | 532.37 KB | 1 年前3 Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020in a period during which a user was active 17 Vasiliki Kalavri | Boston University 2020 Flow Management Operators (I) • Join operators merge two streams by matching elements satisfying a condition blocking and must be defined over a window 18 Vasiliki Kalavri | Boston University 2020 Flow Management Operators (II) • Duplicate/Copy Operator replicates a stream, commonly to be used as input to Z.Event = ‘cancel’ AND Z.ItemID = Y.ItemID Partitions the stream into substreams according to a key A sequence of events that immediately follow one another AS PATTERN (X V* Y W* Z) • Match zero0 码力 | 53 页 | 532.37 KB | 1 年前3
 Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020publisher subscriber notify() subscriber notify() subscriber notify() Subscription management Event Service publish publish notify() subscribe() unsubscribe() subscribe notify unsubscribe background) process that searches for log records with the same key and merges the records by only keeping the most recent value for each key. • A key can also be completely removed if it is assigned a special0 码力 | 33 页 | 700.14 KB | 1 年前3 Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020publisher subscriber notify() subscriber notify() subscriber notify() Subscription management Event Service publish publish notify() subscribe() unsubscribe() subscribe notify unsubscribe background) process that searches for log records with the same key and merges the records by only keeping the most recent value for each key. • A key can also be completely removed if it is assigned a special0 码力 | 33 页 | 700.14 KB | 1 年前3
 PyFlink 1.15 DocumentationIt’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details. Create a virtual environment using virtualenv To create a virtual environment Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported to popular container-orchestration system for automating computer application deployment, scaling, and management. This page shows you how to set up Python environment and exeucte PyFlink jobs in a Kubernetes0 码力 | 36 页 | 266.77 KB | 1 年前3 PyFlink 1.15 DocumentationIt’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details. Create a virtual environment using virtualenv To create a virtual environment Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported to popular container-orchestration system for automating computer application deployment, scaling, and management. This page shows you how to set up Python environment and exeucte PyFlink jobs in a Kubernetes0 码力 | 36 页 | 266.77 KB | 1 年前3
 PyFlink 1.16 DocumentationIt’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details. Create a virtual environment using virtualenv To create a virtual environment Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported to popular container-orchestration system for automating computer application deployment, scaling, and management. This page shows you how to set up Python environment and exeucte PyFlink jobs in a Kubernetes0 码力 | 36 页 | 266.80 KB | 1 年前3 PyFlink 1.16 DocumentationIt’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details. Create a virtual environment using virtualenv To create a virtual environment Submitting PyFlink jobs for more details. 1.1.1.4 YARN Apache Hadoop YARN is a cluster resource management framework for managing the resources and scheduling jobs in a Hadoop cluster. It’s supported to popular container-orchestration system for automating computer application deployment, scaling, and management. This page shows you how to set up Python environment and exeucte PyFlink jobs in a Kubernetes0 码力 | 36 页 | 266.80 KB | 1 年前3
 Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020Algorithms Architecture and design Scheduling and load management Scalability and elasticity Fault-tolerance and guarantees State management Operator semantics Window optimizations Filtering experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 2020 Important dates Deliverable recommendations of products, articles, people 26 Vasiliki Kalavri | Boston University 2020 Online traffic management • Analysis of real-time vehicle locations to improve traffic conditions • Provide real-time scheduling0 码力 | 34 页 | 2.53 MB | 1 年前3 Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020Algorithms Architecture and design Scheduling and load management Scalability and elasticity Fault-tolerance and guarantees State management Operator semantics Window optimizations Filtering experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 2020 Important dates Deliverable recommendations of products, articles, people 26 Vasiliki Kalavri | Boston University 2020 Online traffic management • Analysis of real-time vehicle locations to improve traffic conditions • Provide real-time scheduling0 码力 | 34 页 | 2.53 MB | 1 年前3
共 21 条
- 1
- 2
- 3













