Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020Window operators 2 Vasiliki Kalavri | Boston University 2020 object MaxSensorReadings { def main(args: Array[String]) { val env = StreamExecutionEnvironment.getExecutionEnvironment val StreamExecutionEnvironment: Configuring a time characteristic 4 object AverageSensorReadings { def main(args: Array[String]) { // set up the streaming execution environment val env = StreamExecutionEnvironment reduce/aggregate/process(...) // specify the window function // define a non-keyed window-all operator stream .windowAll(...) // specify the window assigner .reduce/aggregate/process(...) // specify the0 码力 | 35 页 | 444.84 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinktranslates to operations on the underlying RDDs. 9 / 79 StreamingContext ▶ StreamingContext is the main entry point of all Spark Streaming functionality. ▶ The second parameter, Seconds(1), represents val ssc = new StreamingContext(sc, Seconds(1)) 10 / 79 StreamingContext ▶ StreamingContext is the main entry point of all Spark Streaming functionality. ▶ The second parameter, Seconds(1), represents Applies a function to each RDD generated from the stream. • The function is executed in the driver process. 31 / 79 Output Operations (2/4) ▶ What’s wrong with this code? ▶ This requires the connection0 码力 | 113 页 | 1.22 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020that might be unbounded • we cannot store the entire stream in an accessible way • we have to process stream elements on-the-fly using limited memory 2 Vasiliki Kalavri | Boston University 2020 Properties ETL process complex fast and light-weight ETL: Extract-Transform-Load e.g. unzipping compressed files, data cleaning and standardization 6 Vasiliki Kalavri | Boston University 2020 1. Process events or more base and/or derived streams • Each query (operator) maintains its own state • Queries process raw streams, not synopses => results are typically exact • Challenges: computation progress,0 码力 | 45 页 | 1.22 MB | 1 年前3
PyFlink 1.15 DocumentationDependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.3.1.2 O2: Java gateway process exited before sending its port number . . . . . . . . . . . 22 1.3.2 Usage issues . . . . . . . use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec /pat meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn.ship-files=/path/to/shipfiles 0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationDependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.3.1.2 O2: Java gateway process exited before sending its port number . . . . . . . . . . . 22 1.3.2 Usage issues . . . . . . . use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec /pat meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn.ship-files=/path/to/shipfiles 0 码力 | 36 页 | 266.80 KB | 1 年前3
Streaming in Apache Flink.window() .reduce|aggregate|process( ) stream. .windowAll( ) .reduce|aggregate|process( ) ◦TumblingEventTimeWindows.of(Time input = ... input .keyBy(“key”) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .process(new MyWastefulMax()); public static class MyWastefulMax extends ProcessWindowFunction< SensorReading key type TimeWindow> { // window type @Override public void process( String key, Context context, Iterable events, Collector 0 码力 | 45 页 | 3.00 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020consumers can process events. 2 ??? Vasiliki Kalavri | Boston University 2020 Keeping up with the producers • Producers can generate events in a higher rate than the rate consumers can process events. with the producers • Producers can generate events in a higher rate than the rate consumers can process events. • What happens if consumers cannot keep up with the event rate? • drop messages 2 ?? with the producers • Producers can generate events in a higher rate than the rate consumers can process events. • What happens if consumers cannot keep up with the event rate? • drop messages • buffer0 码力 | 43 页 | 2.42 MB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020satisfies causality: • An event is pre-snapshot if it occurs before the local snapshot on a process, otherwise it is post- snapshot • If event A happens causally before B and B is pre-snapshot, duplicate messages • Strongly connected execution graph: each process can reach every other process in the system • Single initiating process 18 The Chandy-Lamport Algorithm A snapshot algorithm that interfere with processing • processing and messages do not stop • Each process cast locally record its own state • Any process can initiate the algorithm 19 The Chandy-Lamport Algorithm ??? Vasiliki0 码力 | 81 页 | 13.18 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020re-ordering of messages • Re-delivery complicates stream processing and fault-tolerance • might process a message out-of-order or twice 14 How can we avoid this? 15 Publish/Subscribe Systems publisher parallelism: the number of the topic's partitions • Processing delays: If a message is slow to process, this delays processing of subsequent messages, as each partition is read by a single thread throughput and ordering? 31 How long to keep the log? • Log compaction: a (usually background) process that searches for log records with the same key and merges the records by only keeping the most0 码力 | 33 页 | 700.14 KB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .process(new UpdateDisjointSet()) // ephemeral partial state .flatMap(new Merger()) // global state > cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .process(new UpdateDisjointSet()) // ephemeral partial state .flatMap(new Merger()) // global state > cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .process(new UpdateDisjointSet()) // ephemeral partial state .flatMap(new Merger()) // global state0 码力 | 72 页 | 7.77 MB | 1 年前3
共 17 条
- 1
- 2













