Scalable Stream Processing - Spark Streaming and FlinkContinuous vs. micro-batch processing ▶ Record-at-a-Time vs. declarative APIs 3 / 79 Outline ▶ Spark streaming ▶ Flink 4 / 79 Spark Streaming 5 / 79 Contribution ▶ Design issues • Continuous vs. micro-batch micro-batch processing • Record-at-a-Time vs. declarative APIs 6 / 79 Spark Streaming ▶ Run a streaming computation as a series of very small, deterministic batch jobs. • Chops up the live stream into function is executed in the driver process. 31 / 79 Output Operations (2/4) ▶ What’s wrong with this code? ▶ This requires the connection object to be serialized and sent from the driver to the worker.0 码力 | 113 页 | 1.22 MB | 1 年前3
PyFlink 1.15 Documentation‘int’ object has no attribute ‘encode’ . . . . . . . . . . . . . . 32 1.3.6.3 Q3: Types.BIG_INT() VS Types.LONG() . . . . . . . . . . . . . . . . . . . . . . 32 1.4 API reference . . . . . . . . . . "/Users/dianfu/code/src/github/pyflink-faq/testing/test_utils.py", line 122, in␣ ˓→setUp self.t_env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode()) File "/Users/dianfu/code/src/github in in_streaming_mode get_gateway().jvm.EnvironmentSettings.inStreamingMode()) File "/Users/dianfu/code/src/github/pyflink-faq/testing/.venv/lib/python3.8/site- ˓→packages/apache_flink-1.14.4-py3.8-macosx-100 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentation‘int’ object has no attribute ‘encode’ . . . . . . . . . . . . . . 32 1.3.6.3 Q3: Types.BIG_INT() VS Types.LONG() . . . . . . . . . . . . . . . . . . . . . . 32 1.4 API reference . . . . . . . . . . "/Users/dianfu/code/src/github/pyflink-faq/testing/test_utils.py", line 122, in␣ ˓→setUp self.t_env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode()) File "/Users/dianfu/code/src/github in in_streaming_mode get_gateway().jvm.EnvironmentSettings.inStreamingMode()) File "/Users/dianfu/code/src/github/pyflink-faq/testing/.venv/lib/python3.8/site- ˓→packages/apache_flink-1.14.4-py3.8-macosx-100 码力 | 36 页 | 266.80 KB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 Outcomes At the end of the course, you will hopefully: • know when to use stream processing vs other technology • be able to comprehensively compare features and processing guarantees of streaming Deliverables • One (1) written report of maximum 5 pages (10%). • Code (including pre-processing, deployment, and testing): (40%) • code deliverables must be accompanied by documentation 8 Vasiliki Kalavri0 码力 | 34 页 | 2.53 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 20204 storage analytics static data streaming data Vasiliki Kalavri | Boston University 2020 DBMS vs. DSMS DBMS DSMS Data persistent relations streams Data Access random sequential, single-pass Updates continuous Latency relatively high low 5 Vasiliki Kalavri | Boston University 2020 Traditional DW vs. SDW Traditional DW SDW Update Frequency low high Update propagation synchronized asynchronous stream is interpreted as describing a changing relation. • Stream elements bear a valid timestamp, Vs, after which they are considered valid and they can contribute to the result. • alternatively,0 码力 | 45 页 | 1.22 MB | 1 年前3
【05 计算平台 蓉荣】Flink 批处理及其应⽤为什什么Flink能做批处理理 Table Stream Bounded Data Unbounded Data SQL Runtime SQL ⾼高吞吐 低延时 Hive vs. Spark vs. Flink Batch Hive/Hadoop Spark Flink 模型 MR MR(Memory/Disk) Pipeline 吞吐 TB-PB TB-PB 未经⼤大规模⽣生产验证 HiveSQL SparkSQL ANSI SQL 易易⽤用性 ⼀一般 易易⽤用 ⼀一般 ⼯工具/⽣生态 ⼀一般 丰富 ⼀一般 Flink Batch应⽤用 - 数据湖 Data Lake vs. Data Warehouse Flink Batch应⽤用 - 数据湖 Flink Batch应⽤用 - 数据湖 Blink SQL+UDF Queue 存储类 存储 计算 存储0 码力 | 12 页 | 1.44 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020message broker delivers messages to all subscribed consumers in a broadcast fashion. 11 12 Brokers vs. Databases • DBs keep data until explicitly deleted while MBs delete messages once consumed. • Asynchronous RPC Futures Message Queues Pub/Sub Yes Yes Yes Can you fill this in? 19 Pub/Sub vs. other paradigms Paradigm Space Decoupling Time Decoupling Synchronization Decoupling Message-passing consumer had read messages later than its recorded offset How can we avoid re-processing? 30 Logs vs. in-memory brokers • Multiple consumers with different processing speeds: reading a message doesn't0 码力 | 33 页 | 700.14 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020order of operators • scheduling and placement decisions • different algorithms, e.g. hash-based vs. broadcast join • What does performance depend on? • input data, intermediate data • operator plans • minimize intermediate data and communication Alternatives • data structures • sorting vs hashing • indexing, pre-fetching • minimize disk access • scheduling Objectives • optimize equivalent computation • Ensure mergeable state: even a simple counter might differ on a combined stream vs. on separate streams Redundancy elimination Eliminate redundant operations, aka subgraph sharing0 码力 | 54 页 | 2.83 MB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020WindowedStream (or AllWindowedStream) and processes the elements assigned to a window. 5 Keyed vs. non-keyed windows Vasiliki Kalavri | Boston University 2020 // define a keyed window operator stream specify the window assigner .reduce/aggregate/process(...) // specify the window function 6 Keyed vs. non-keyed windows Vasiliki Kalavri | Boston University 2020 Time-based window assigners for the0 码力 | 35 页 | 444.84 KB | 1 年前3
Streaming in Apache FlinkextractTimestamp(MyEvent event) { return element.getCreationTime(); } } Windows (Not the OS) Global Vs Keyed Windows stream. .keyBy() .window( ) .reduce|aggregate|process( 0 码力 | 45 页 | 3.00 MB | 1 年前3
共 16 条
- 1
- 2













