SOURCES - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Scalable Stream Processing - Spark Streaming and Flink

streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources, e.g g., user-provided sources. 13 / 79 Input Operations ▶ Every input DStream is associated with a Receiver object. • It receives the data from a source and stores it in Spark’s memory for processing. streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources, e.g

0 码力 | 113 页 | 1.22 MB | 1 年前
3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

(Vasia) Kalavri  vkalavri@bu.edu Spring 2020 1/28: Stream ingestion and pub/sub systems Streaming sources Files, e.g. transaction logs Sockets IoT devices and sensors Databases and KV stores Message Where do stream processors read data from? 2 Challenges • can be distributed • out-of-sync sources may produce out-of-order streams • can be connected to the network • latency and unpredictable processor should be able to make progress • might fail (or seem as if they failed) Streaming sources… 3 Producers and consumers • An event is typically generated by a producer (or publisher or sender)

0 码力 | 33 页 | 700.14 KB | 1 年前
3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

avoid latency increase • monitor input rates • Where in the query plan? • dropping at the sources vs. dropping at bottleneck operators • How much load to shed? • enough for the system to keep-up applies shedding to entire windows instead of individual tuples • When discarding tuples at the sources or another point in a query with multiple window aggregations, it is unclear how shedding will affect dataflow graph, back-pressure propagates to upstream operators, eventually reaching the data stream sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage

0 码力 | 43 页 | 2.42 MB | 1 年前
3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

a-priori. • They bear an arrival and/or a generation timestamp. • They are produced by external sources, i.e. the DSMS has no control over their arrival order or the data rate. • They have unknown University 2020 Lecture references Some material in this lecture was assembled from the following sources: • Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. Data Stream Management: Processing

0 码力 | 45 页 | 1.22 MB | 1 年前
3
监控Apache Flink应用程序(入门)

Latency Tracking4. When enabled, Flink will insert so-called latency markers periodically at all sources. For each sub-task, a latency distribution from each source to this operator will be reported. The significantly impact the performance of the cluster. It is recommended to only enable it to locate sources of latency during debugging. 4.12.1 Key Metrics Metric Scope Description latency

0 码力 | 23 页 | 148.62 KB | 1 年前
3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

dataflow graph for several queries • when applications analyze data streams from a small set of sources • Operator elimination • remove a no-op, e.g. a projection that keeps all attributes • remove • Statis Viglas and Jeffrey Naughton. Rate-based Query Optimization for Streaming Information Sources. SIGMOD 2002. Further reading

0 码力 | 54 页 | 2.83 MB | 1 年前
3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Kalavri | Boston University 2020 The automatic scaling problem 5 Given a logical dataflow with sources S1, S2, … Sn and rates λ1, λ2, … λn identify the minimum parallelism πi per operator i, such • assign an increasing sequential id to all operators in topological order, starting from the sources • represent as an adjacency matrix A • Aij = 1 iff operator i is upstream neighbor of j 17

0 码力 | 93 页 | 2.42 MB | 1 年前
3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

is satisfied if initiator can reach all tasks (possible in DAGs via multiple initiators, e.g., sources.) p1 p2 p3 p4 p5 p6 p7 p7 p5 p6 p1 p2 p3 p4 34 ??? Vasiliki Kalavri | Boston University 2020 checkpoint. 3. Resume processing. ??? Vasiliki Kalavri | Boston University 2020 Re-settable sources • All input streams are reset to the position up to which they were consumed when the checkpoint previous offset of the stream. 43 ??? Vasiliki Kalavri | Boston University 2020 Re-settable sources • All input streams are reset to the position up to which they were consumed when the checkpoint

0 码力 | 81 页 | 13.18 MB | 1 年前
3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Watermarks (in Flink) flow along dataflow edges. They are special records generated by the sources or assigned by the application. A watermark for time T states that event time has progressed to

0 码力 | 22 页 | 2.22 MB | 1 年前
3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

18 Vasiliki Kalavri | Boston University 2020 Can you give me some examples of streaming data sources? 19 Vasiliki Kalavri | Boston University 2020 20 Location-based services Vasiliki Kalavri | Boston

0 码力 | 34 页 | 2.53 MB | 1 年前
3

共 12 条前往

页

分类

语言

格式

Scalable Stream Processing - Spark Streaming and Flink

Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

监控Apache Flink应用程序(入门)

Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020