Scalable Stream Processing - Spark Streaming and Flinkapache.spark.streaming._ // Create a local StreamingContext with two working threads and batch interval of 1 second. val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val awaitTermination() 40 / 79 Word Count in Spark Streaming (6/6) val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = start() ssc.awaitTermination() 41 / 79 Word Count with Window val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines =0 码力 | 113 页 | 1.22 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 20202. store in local buffer and possibly update state 3. produce output 5 mi mo Vasiliki Kalavri | Boston University 2020 What is a failure? op 1. receive an event 2. store in local buffer and possibly Vasiliki Kalavri | Boston University 2020 What is a failure? op 1. receive an event 2. store in local buffer and possibly update state 3. produce output 5 mi mo Was mi fully processed? Was mo delivered Vasiliki Kalavri | Boston University 2020 What is a failure? op 1. receive an event 2. store in local buffer and possibly update state 3. produce output What can go wrong: • lost events • duplicate0 码力 | 49 页 | 2.08 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 All data maintained by a task and used to compute results: a local or instance variable that is accessed by a task’s business logic Operator state is scoped to an and maintained. State backends are responsible for: • local state management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available Stores state on TaskManager’s heap but checkpoints it to a remote file system • In-memory speed for local accesses and fault tolerance • Limited to TaskManager’s memory and might suffer from GC pauses Which0 码力 | 24 页 | 914.13 KB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020in-flight data to be completely processed 3. Copy the state of each task to a remote, persistent storage 4. Wait until all tasks have finished their copies 5. Resume processing and stream ingestion consistency A consistent cut satisfies causality: • An event is pre-snapshot if it occurs before the local snapshot on a process, otherwise it is post- snapshot • If event A happens causally before B and University 2020 Completing a snapshot When all processes have received a marker and recorded their local state and ll processes have received markers on all incoming channels and have recorded all channel0 码力 | 81 页 | 13.18 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020Flink programs are defined in regular Scala/Java methods Set up the execution environment: local, cluster, I/O, time semantics, parallelism, … Example: Sensor Readings 9 Vasiliki Kalavri | Boston distributed and fault-tolerant publish-subscribe messaging system and serves as the ingestion, storage, and messaging layer for large production streaming pipelines. Kafka is commonly deployed on a0 码力 | 26 页 | 3.33 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage is required. 21 o1 src o2 back-pressure target: 40 rec/s 10 rec/s 100 rec/s ??? Vasiliki Kalavri they have been consumed and can be re-used. ??? Vasiliki Kalavri | Boston University 2020 24 Local exchange: If both producer and consumer run on the same node the buffer is recycled as soon as it0 码力 | 43 页 | 2.42 MB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020JobGraph and all required metadata, such as the application’s JAR file, into a remote persistent storage system • Zookeeper also holds state handles and checkpoint locations 5 JobManager failures following steps: 1. It requests the storage locations from ZooKeeper to fetch the JobGraph, the JAR file, and the state handles of the last checkpoint from remote storage. 2. It requests processing slots0 码力 | 41 页 | 4.09 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Message queues • Asynchronous point-to-point communication • Lightweight buffer for temporary storage • Messages stored on the queue until they are processed and deleted • transactional, timing, and explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput might degrade. • DBs support0 码力 | 33 页 | 700.14 KB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020Netbeans with appropriate plugins installed. • gsutil for accessing datasets in Google Cloud Storage. More details: vasia.github.io/dspa20/exercises.html 14 Vasiliki Kalavri | Boston University Continuously arriving, possibly unbounded data f read write Complete data accessible in persistent storage 30 Vasiliki Kalavri | Boston University 2020 Consider a set of 1000 sensors deployed in different0 码力 | 34 页 | 2.53 MB | 1 年前3
PyFlink 1.15 DocumentationPreparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1.2 Local . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.1.3 Standalone which contains its own Python executable files and the installed Python packages. It is useful for local development to create a standalone Python environment and also useful when deploying a PyFlink job previous page) # -rw-r--r-- 1 dianfu staff 45K 10 18 20:54 flink-dianfu-python-B-7174MD6R-1908. ˓→local.log Besides, you could also check if the files of the PyFlink package are consistent. It may happen0 码力 | 36 页 | 266.77 KB | 1 年前3
共 17 条
- 1
- 2













