distributed SQL query engines - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

or more streams of possibly different type A series of transformations on streams in Stream SQL, Scala, Python, Rust, Java… ??? Vasiliki Kalavri | Boston University 2020 Logic State Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki Kalavri | Boston University 2020 9 Identify the most efficient way to execute a query • There may exist several ways to to execute a computation • query plans, e.g. order of operators • scheduling and placement decisions • different algorithms, e.g. hash-based vs. broadcast join • What does performance depend on?

0 码力 | 54 页 | 2.83 MB | 1 年前
3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively high low Useful in theory for the development of streaming algorithms With limited practical value in distributed, real-world settings Vasiliki Kalavri | Boston University 2020 Cash-Register Model: In this • Derived stream: produced by a continuous query and its operators, e.g. total traffic from a source every minute

0 码力 | 45 页 | 1.22 MB | 1 年前
3
PyFlink 1.15 Documentation

environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible

0 码力 | 36 页 | 266.77 KB | 1 年前
3
PyFlink 1.16 Documentation

environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible

0 码力 | 36 页 | 266.80 KB | 1 年前
3
Apache Flink的过去、现在和未来

2014 • 柏林工业大学博士生项目 • 基于流式 runtime 的批处理引擎 • 2014 年 8 月份发布 Flink 0.6.0 Flink 0.7 Runtime Distributed Streaming Dataflow DataStream API Stream Processing DataSet API Batch Processing 2014 年 12 Schedule Task YARN RM K8S RM 增量 Checkpoint 时间全量状态增量状态增量 snapshot 基于 credit 的流控机制 Streaming SQL ------------------------- | USER_SCORES | ------------------------- | User | Score | Time Flink 1.9 的架构变化 Runtime Distributed Streaming Dataflow Query Processor DAG & StreamOperator Local Single JVM Cloud GCE, EC2 Cluster Standalone, YARN Runtime Distributed Streaming Dataflow DataStream

0 码力 | 33 页 | 3.36 MB | 1 年前
3
Scalable Stream Processing - Spark Streaming and Flink

Built on the Spark SQL engine. ▶ Perform database-like query optimizations. 56 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1. Defines a query on the input table table, as a static table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires develop a Spark stuctured streaming: ▶ 1. Defines a query on the input table, as a static table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers

0 码力 | 113 页 | 1.22 MB | 1 年前
3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Boston University 2020 Three classes of operators: • relation-to-relation: similar to standard SQL and define queries over tables. • stream-to-relation: define tables by selecting portions of a • A Blocking query operator can only return answers when it detects the end of its input. • NOT IN, set difference and division, traditional SQL aggregates • A Non-blocking query operator can produce operator, iff F is monotonic with respect to the partial ordering ⊆. A query Q on a stream S can be implemented by a non-blocking query operator iff Q(S) is monotonic with respect to ⊆. The traditional

0 码力 | 53 页 | 532.37 KB | 1 年前
3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Anis Uddin Nasir et. al. The power of both choices: Practical load balancing for distributed stream processing engines. ICDE 2015. • Mitzenmacher, Michael. The power of two choices in randomized load

0 码力 | 31 页 | 1.47 MB | 1 年前
3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Message queues and brokers Where do stream processors read data from? 2 Challenges • can be distributed • out-of-sync sources may produce out-of-order streams • can be connected to the network search while MBs only offer topic-based subscription. • DB query results depend on a snapshot and clients are not notified if their query result changes later. 13 Message delivery and ordering Acknowledgements applications. 23 Use-cases • Balancing workloads in network clusters • tasks can be efficiently distributed among multiple workers, such as Google Compute Engine instances. • Distributing event notifications

0 码力 | 33 页 | 700.14 KB | 1 年前
3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

5 ??? Vasiliki Kalavri | Boston University 2020 Load shedding as an optimization problem N: query network I: set of input streams with known arrival rates C: system processing capacity H: headroom continuously monitors input rates or other system metrics and can access information about the running query plan • It detects overload and decides what actions to take in order to maintain acceptable latency Fast approximate answers … S1 S2 Sr Input Manager Scheduler QoS Monitor Load Shedder Query Execution Engine Qm Q2 Q1 Ad-hoc or continuous queries Input streams … ??? Vasiliki Kalavri

0 码力 | 43 页 | 2.42 MB | 1 年前
3

共 23 条前往

页

分类

语言

格式

Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

PyFlink 1.15 Documentation

PyFlink 1.16 Documentation

Apache Flink的过去、现在和未来

Scalable Stream Processing - Spark Streaming and Flink

Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020