Scalable Stream Processing - Spark Streaming and Flinkstreaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources, e.g g., user-provided sources. 13 / 79 Input Operations ▶ Every input DStream is associated with a Receiver object. • It receives the data from a source and stores it in Spark’s memory for processing. streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources, e.g0 码力 | 113 页 | 1.22 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020(Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/28: Stream ingestion and pub/sub systems Streaming sources Files, e.g. transaction logs Sockets IoT devices and sensors Databases and KV stores Message Where do stream processors read data from? 2 Challenges • can be distributed • out-of-sync sources may produce out-of-order streams • can be connected to the network • latency and unpredictable processor should be able to make progress • might fail (or seem as if they failed) Streaming sources… 3 Producers and consumers • An event is typically generated by a producer (or publisher or sender)0 码力 | 33 页 | 700.14 KB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020avoid latency increase • monitor input rates • Where in the query plan? • dropping at the sources vs. dropping at bottleneck operators • How much load to shed? • enough for the system to keep-up applies shedding to entire windows instead of individual tuples • When discarding tuples at the sources or another point in a query with multiple window aggregations, it is unclear how shedding will affect dataflow graph, back-pressure propagates to upstream operators, eventually reaching the data stream sources. • To ensure no data loss, a persistent input message queue, such as Kafka, and enough storage0 码力 | 43 页 | 2.42 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020a-priori. • They bear an arrival and/or a generation timestamp. • They are produced by external sources, i.e. the DSMS has no control over their arrival order or the data rate. • They have unknown University 2020 Lecture references Some material in this lecture was assembled from the following sources: • Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. Data Stream Management: Processing0 码力 | 45 页 | 1.22 MB | 1 年前3
监控Apache Flink应用程序(入门)Latency Tracking4. When enabled, Flink will insert so-called latency markers periodically at all sources. For each sub-task, a latency distribution from each source to this operator will be reported. The significantly impact the performance of the cluster. It is recommended to only enable it to locate sources of latency during debugging. 4.12.1 Key Metrics Metric Scope Description latency0 码力 | 23 页 | 148.62 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020dataflow graph for several queries • when applications analyze data streams from a small set of sources • Operator elimination • remove a no-op, e.g. a projection that keeps all attributes • remove • Statis Viglas and Jeffrey Naughton. Rate-based Query Optimization for Streaming Information Sources. SIGMOD 2002. Further reading0 码力 | 54 页 | 2.83 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 The automatic scaling problem 5 Given a logical dataflow with sources S1, S2, … Sn and rates λ1, λ2, … λn identify the minimum parallelism πi per operator i, such • assign an increasing sequential id to all operators in topological order, starting from the sources • represent as an adjacency matrix A • Aij = 1 iff operator i is upstream neighbor of j 170 码力 | 93 页 | 2.42 MB | 1 年前3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020is satisfied if initiator can reach all tasks (possible in DAGs via multiple initiators, e.g., sources.) p1 p2 p3 p4 p5 p6 p7 p7 p5 p6 p1 p2 p3 p4 34 ??? Vasiliki Kalavri | Boston University 2020 checkpoint. 3. Resume processing. ??? Vasiliki Kalavri | Boston University 2020 Re-settable sources • All input streams are reset to the position up to which they were consumed when the checkpoint previous offset of the stream. 43 ??? Vasiliki Kalavri | Boston University 2020 Re-settable sources • All input streams are reset to the position up to which they were consumed when the checkpoint0 码力 | 81 页 | 13.18 MB | 1 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020Watermarks (in Flink) flow along dataflow edges. They are special records generated by the sources or assigned by the application. A watermark for time T states that event time has progressed to0 码力 | 22 页 | 2.22 MB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 202018 Vasiliki Kalavri | Boston University 2020 Can you give me some examples of streaming data sources? 19 Vasiliki Kalavri | Boston University 2020 20 Location-based services Vasiliki Kalavri | Boston0 码力 | 34 页 | 2.53 MB | 1 年前3
共 12 条
- 1
- 2













