Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020traditional data processing applications, we know the entire dataset in advance, e.g. tables stored in a database. A data stream is a data set that is produced incrementally over time, rather than being available not know when the stream ends. 3 Vasiliki Kalavri | Boston University 2020 DW DBMS SDW DSMS Database Management System • ad-hoc queries, data manipulation tasks • insertions, updates, deletions Vasiliki Kalavri | Boston University 2020 Synopsis maintenance & Stream Query Processing Engine Synopsis for R1 Synopsis for Rr … Query Q(R1, …, Rr) Approximate answers to Q … 31 Stream0 码力 | 45 页 | 1.22 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020userSentPayment 4 Connecting producers to consumers • Indirectly • Producer writes to a file or database • Consumer periodically polls and retrieves new data • polling overhead, latency? • Consumer Databases • DBs keep data until explicitly deleted while MBs delete messages once consumed. • Use a database for long-term data storage! • MBs assume a small working set. If consumers are slow, throughput clusters • tasks can be efficiently distributed among multiple workers, such as Google Compute Engine instances. • Distributing event notifications • a service that accepts user signups can send0 码力 | 33 页 | 700.14 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkoperations 30 / 79 Output Operations (1/4) ▶ Push out DStream’s data to external systems, e.g., a database or a file system. ▶ foreachRDD: the most generic output operator • Applies a function to each data stream as a table that is being continuously appended. ▶ Built on the Spark SQL engine. ▶ Perform database-like query optimizations. 56 / 79 Programming Model (1/2) ▶ Two main steps to develop Clusters”, HotCloud’12. ▶ P. Carbone et al., “Apache flink: Stream and batch processing in a single engine”, 2015. ▶ Some slides were derived from Heather Miller’s slides: http://heather.miller.am/teac0 码力 | 113 页 | 1.22 MB | 1 年前3
PyFlink 1.15 Documentationcontext for creating Table and SQL API programs. Flink is an unified streaming and batch computing engine, which provides unified streaming and batch API to create a TableEnvironment. TableEnvironment is central concept for creating DataStream API programs. Flink is an unified streaming and batch computing engine, which provides unified streaming and batch API to create a StreamExecutionEnvironment. StreamExecutionEnvironment api.ValidationException: Unable to create a source for reading␣ ˓→table 'default_catalog.default_database.sourceKafka'. Table options are: 'connector'='kafka' 'format'='json' 'properties.bootstrap.servers'='1920 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationcontext for creating Table and SQL API programs. Flink is an unified streaming and batch computing engine, which provides unified streaming and batch API to create a TableEnvironment. TableEnvironment is central concept for creating DataStream API programs. Flink is an unified streaming and batch computing engine, which provides unified streaming and batch API to create a StreamExecutionEnvironment. StreamExecutionEnvironment api.ValidationException: Unable to create a source for reading␣ ˓→table 'default_catalog.default_database.sourceKafka'. Table options are: 'connector'='kafka' 'format'='json' 'properties.bootstrap.servers'='1920 码力 | 36 页 | 266.80 KB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020management • checkpointing state to remote and persistent storage, e.g. a distributed filesystem or a database system • Available state backends in Flink: • In-memory • File system • RocksDB State backends choose? 9 Vasiliki Kalavri | Boston University 2020 RocksDB 10 RocksDB is an LSM-tree storage engine with key/value interface, where keys and values are arbitrary byte streams. https://rocksdb.org/0 码力 | 24 页 | 914.13 KB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 ESL: Expressive Stream Language • Ad-hoc SQL queries • Updates on database tables • Continuous queries on data streams • New streams (derived) are defined as virtual views of the 10th international conference on Database Theory (ICDT’05). • Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Query languages and data models for database sequences and data streams. In Proceedings0 码力 | 53 页 | 532.37 KB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020answers … S1 S2 Sr Input Manager Scheduler QoS Monitor Load Shedder Query Execution Engine Qm Q2 Q1 Ad-hoc or continuous queries Input streams … ??? Vasiliki Kalavri | Boston University0 码力 | 43 页 | 2.42 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020generated as a stream of edges? • How can we perform iterative computation in a streaming dataflow engine? How can we propagate watermarks? • Do we need to run the computation from scratch for every new0 码力 | 72 页 | 7.77 MB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020the queries in advance • we can store a fixed proportion of the stream, e.g. 1/10th 7 search enginequery stream Example use-case: Web search user behavior study Q: How 0 码力 | 74 页 | 1.06 MB | 1 年前3
共 13 条
- 1
- 2













