Storage Backend - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020

are responsible for: • local state management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory accesses and fault tolerance • Limited to TaskManager’s memory and might suffer from GC pauses Which backend to choose? 8 Vasiliki Kalavri | Boston University 2020 RocksDBStateBackend • Stores all state for applications with very large state Which backend to choose? 9 Vasiliki Kalavri | Boston University 2020 RocksDB 10 RocksDB is an LSM-tree storage engine with key/value interface, where keys

0 码力 | 24 页 | 914.13 KB | 1 年前
3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Message queues • Asynchronous point-to-point communication • Lightweight buffer for temporary storage • Messages stored on the queue until they are processed and deleted • transactional, timing, and explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput might degrade. • DBs support • Data streaming from various processes or devices • a residential sensor can stream data to backend servers hosted in the cloud. 24 A publisher application creates a topic and sends messages

0 码力 | 33 页 | 700.14 KB | 1 年前
3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

in-flight data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing and stream ingestion University 2020 32 Epoch-Based Stream Execution Logged Input Streams Committed Output Streams Stable Storage ⇧epi AB8nicbVA9T8MwEL2Ur1K+CowsFi0SU5V0AbYKFsYiEajURJHjOq1 onPkoUvUQHeoiXxE0SN6Rq/ozZHOi/PufCxaC04+c4z+wPn8AedOjc= Local State Backend External State Backend ⇧ep2 AB8nicbVA9T8MwEHXKVylf

0 码力 | 81 页 | 13.18 MB | 1 年前
3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020

the spanner? As an adjacency list? which state primitives are suitable? Is RocksDB a suitable backend for graph state? • How to compute the distance between edges? Do we need to do that for every

0 码力 | 72 页 | 7.77 MB | 1 年前
3
Scalable Stream Processing - Spark Streaming and Flink

stateful streams. val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint("path/to/persistent/storage") 45 / 79 Stateful Stream Operations ▶ Spark API proposes two functions for statful processing: since the last trigger will be written to the external storage. 2. Complete: the entire updated result table will be written to external storage. 3. Update: only the rows that were updated in the result result table since the last trigger will be changed in the external storage. • This mode works for output sinks that can be updated in place, such as a MySQL table. 59 / 79 Output Modes ▶ Three output

0 码力 | 113 页 | 1.22 MB | 1 年前
3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

JobGraph and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures following steps: 1. It requests the storage locations from ZooKeeper to fetch the JobGraph, the JAR file, and the state handles of the last checkpoint from remote storage. 2. It requests processing slots

0 码力 | 41 页 | 4.09 MB | 1 年前
3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Netbeans with appropriate plugins installed. • gsutil for accessing datasets in Google Cloud Storage. More details: vasia.github.io/dspa20/exercises.html 14 Vasiliki Kalavri | Boston University Continuously arriving, possibly unbounded data f read write Complete data accessible in persistent storage 30 Vasiliki Kalavri | Boston University 2020 Consider a set of 1000 sensors deployed in different

0 码力 | 34 页 | 2.53 MB | 1 年前
3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020

to address non-determinism • Each output is checkpointed together with its unique ID to stable storage before being delivered to the next stage • Retries simply replay the output that has been checkpointed false the record is not a duplicate • if it returns true, the worker sends a lookup to stable storage 20 Vasiliki Kalavri | Boston University 2020 21 http://streamingbook.net/fig/5-5 Bloom filter:

0 码力 | 49 页 | 2.08 MB | 1 年前
3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020

distributed and fault-tolerant publish-subscribe messaging system and serves as the ingestion, storage, and messaging layer for large production streaming pipelines. Kafka is commonly deployed on a

0 码力 | 26 页 | 3.33 MB | 1 年前
3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage is required. 21 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s ??? Vasiliki Kalavri

0 码力 | 43 页 | 2.42 MB | 1 年前
3

共 11 条前往

页

分类

语言

格式

State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Scalable Stream Processing - Spark Streaming and Flink

Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020