 State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020are responsible for: • local state management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory accesses and fault tolerance • Limited to TaskManager’s memory and might suffer from GC pauses Which backend to choose? 8 Vasiliki Kalavri | Boston University 2020 RocksDBStateBackend • Stores all state for applications with very large state Which backend to choose? 9 Vasiliki Kalavri | Boston University 2020 RocksDB 10 RocksDB is an LSM-tree storage engine with key/value interface, where keys0 码力 | 24 页 | 914.13 KB | 1 年前3 State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020are responsible for: • local state management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory accesses and fault tolerance • Limited to TaskManager’s memory and might suffer from GC pauses Which backend to choose? 8 Vasiliki Kalavri | Boston University 2020 RocksDBStateBackend • Stores all state for applications with very large state Which backend to choose? 9 Vasiliki Kalavri | Boston University 2020 RocksDB 10 RocksDB is an LSM-tree storage engine with key/value interface, where keys0 码力 | 24 页 | 914.13 KB | 1 年前3
 Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Message queues • Asynchronous point-to-point communication • Lightweight buffer for temporary storage • Messages stored on the queue until they are processed and deleted • transactional, timing, and explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput might degrade. • DBs support • Data streaming from various processes or devices • a residential sensor can stream data to backend servers hosted in the cloud. 24 A publisher application creates a topic and sends messages0 码力 | 33 页 | 700.14 KB | 1 年前3 Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Message queues • Asynchronous point-to-point communication • Lightweight buffer for temporary storage • Messages stored on the queue until they are processed and deleted • transactional, timing, and explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput might degrade. • DBs support • Data streaming from various processes or devices • a residential sensor can stream data to backend servers hosted in the cloud. 24 A publisher application creates a topic and sends messages0 码力 | 33 页 | 700.14 KB | 1 年前3
 Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020in-flight data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing and stream ingestion University 2020 32 Epoch-Based Stream Execution Logged Input Streams Committed Output Streams Stable Storage ⇧epi Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020in-flight data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing and stream ingestion University 2020 32 Epoch-Based Stream Execution Logged Input Streams Committed Output Streams Stable Storage ⇧epi- AB8nicbVA9T8MwEL2Ur1K+CowsFi0SU5V0AbYKFsYiEajURJHjOq1 onPkoUvUQHeoiXxE0SN6Rq/ozZHOi/PufCxaC04+c4z+wPn8AedOjc= Local State Backend External State Backend ⇧ep2- AB8nicbVA9T8MwEHXKVylf 0 码力 | 81 页 | 13.18 MB | 1 年前3
 Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020the spanner? As an adjacency list? which state primitives are suitable? Is RocksDB a suitable backend for graph state? • How to compute the distance between edges? Do we need to do that for every0 码力 | 72 页 | 7.77 MB | 1 年前3 Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020the spanner? As an adjacency list? which state primitives are suitable? Is RocksDB a suitable backend for graph state? • How to compute the distance between edges? Do we need to do that for every0 码力 | 72 页 | 7.77 MB | 1 年前3
 Scalable Stream Processing - Spark Streaming and Flinkstateful streams. val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint("path/to/persistent/storage") 45 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: since the last trigger will be written to the external storage. 2. Complete: the entire updated result table will be written to external storage. 3. Update: only the rows that were updated in the result result table since the last trigger will be changed in the external storage. • This mode works for output sinks that can be updated in place, such as a MySQL table. 59 / 79 Output Modes ▶ Three output0 码力 | 113 页 | 1.22 MB | 1 年前3 Scalable Stream Processing - Spark Streaming and Flinkstateful streams. val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint("path/to/persistent/storage") 45 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: since the last trigger will be written to the external storage. 2. Complete: the entire updated result table will be written to external storage. 3. Update: only the rows that were updated in the result result table since the last trigger will be changed in the external storage. • This mode works for output sinks that can be updated in place, such as a MySQL table. 59 / 79 Output Modes ▶ Three output0 码力 | 113 页 | 1.22 MB | 1 年前3
 Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020JobGraph and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures following steps: 1. It requests the storage locations from ZooKeeper to fetch the JobGraph, the JAR file, and the state handles of the last checkpoint from remote storage. 2. It requests processing slots0 码力 | 41 页 | 4.09 MB | 1 年前3 Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020JobGraph and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures following steps: 1. It requests the storage locations from ZooKeeper to fetch the JobGraph, the JAR file, and the state handles of the last checkpoint from remote storage. 2. It requests processing slots0 码力 | 41 页 | 4.09 MB | 1 年前3
 Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020Netbeans with appropriate plugins installed. • gsutil for accessing datasets in Google Cloud Storage. More details: vasia.github.io/dspa20/exercises.html 14 Vasiliki Kalavri | Boston University Continuously arriving, possibly unbounded data f read write Complete data accessible in persistent storage 30 Vasiliki Kalavri | Boston University 2020 Consider a set of 1000 sensors deployed in different0 码力 | 34 页 | 2.53 MB | 1 年前3 Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020Netbeans with appropriate plugins installed. • gsutil for accessing datasets in Google Cloud Storage. More details: vasia.github.io/dspa20/exercises.html 14 Vasiliki Kalavri | Boston University Continuously arriving, possibly unbounded data f read write Complete data accessible in persistent storage 30 Vasiliki Kalavri | Boston University 2020 Consider a set of 1000 sensors deployed in different0 码力 | 34 页 | 2.53 MB | 1 年前3
 High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020to address non-determinism • Each output is checkpointed together with its unique ID to stable storage before being delivered to the next stage • Retries simply replay the output that has been checkpointed false the record is not a duplicate • if it returns true, the worker sends a lookup to stable storage 20 Vasiliki Kalavri | Boston University 2020 21 http://streamingbook.net/fig/5-5 Bloom filter:0 码力 | 49 页 | 2.08 MB | 1 年前3 High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020to address non-determinism • Each output is checkpointed together with its unique ID to stable storage before being delivered to the next stage • Retries simply replay the output that has been checkpointed false the record is not a duplicate • if it returns true, the worker sends a lookup to stable storage 20 Vasiliki Kalavri | Boston University 2020 21 http://streamingbook.net/fig/5-5 Bloom filter:0 码力 | 49 页 | 2.08 MB | 1 年前3
 Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020distributed and fault-tolerant publish-subscribe messaging system and serves as the ingestion, storage, and messaging layer for large production streaming pipelines. Kafka is commonly deployed on a0 码力 | 26 页 | 3.33 MB | 1 年前3 Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020distributed and fault-tolerant publish-subscribe messaging system and serves as the ingestion, storage, and messaging layer for large production streaming pipelines. Kafka is commonly deployed on a0 码力 | 26 页 | 3.33 MB | 1 年前3
 Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage is required. 21 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s ??? Vasiliki Kalavri0 码力 | 43 页 | 2.42 MB | 1 年前3 Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage is required. 21 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s ??? Vasiliki Kalavri0 码力 | 43 页 | 2.42 MB | 1 年前3
共 11 条
- 1
- 2
相关搜索词













