Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020entire stream in an accessible way • we have to process stream elements on-the-fly using limited memory 2 Vasiliki Kalavri | Boston University 2020 Properties of data streams • They arrive continuously single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively University 2020 Time-Series Model: The jth update is (j, A[j]) and updates arrive in increasing order of j, i.e. we observe the entries of A by increasing index. This can model time-series data streams:0 码力 | 45 页 | 1.22 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020graph directed graph 4 ??? Vasiliki Kalavri | Boston University 2020 Graph streams Graph streams model interactions as events that update an underlying graph structure 5 Edge events: A purchase 2020 8 Some algorithms model graph streams a sequence of vertex events. A vertex stream consists of events that contain a vertex and all of its neighbors. Although this model can enable a theoretical theoretical analysis of streaming algorithms, it cannot adequately model real-world unbounded streams, as the neighbors cannot be known in advance. Vertex streams (not today) ??? Vasiliki Kalavri | Boston0 码力 | 72 页 | 7.77 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkassociated with a Receiver object. • It receives the data from a source and stores it in Spark’s memory for processing. ▶ Three categories of streaming sources: 1. Basic sources directly available in associated with a Receiver object. • It receives the data from a source and stores it in Spark’s memory for processing. ▶ Three categories of streaming sources: 1. Basic sources directly available in Sources (2/3) class CustomReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging { def onStart() { new Thread("Socket Receiver") { override def run() {0 码力 | 113 页 | 1.22 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020processed? Was mo delivered downstream? Vasiliki Kalavri | Boston University 2020 A simple system model stream sources N1 NK N2 … input queue output queue primary nodes secondary nodes other apps performance affected by the fault-tolerance mechanism under normal, failure- free operation? • How much memory or disk space is required to maintain input tuples and state? Recovery speed • How long does it0 码力 | 49 页 | 2.08 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020poles placement, sampling period, damping Cannot identify individual bottlenecks neither model 2-input operators ??? Vasiliki Kalavri | Boston University 2020 Heuristic models 11 • Metrics 1s o2 o1 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model 17 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model • Collect metrics per configurable observation window W • activity Rprc and records pushed to output Rpsd 17 ??? Vasiliki Kalavri | Boston University 2020 The DS2 model • Collect metrics per configurable observation window W • activity durations per worker • records0 码力 | 93 页 | 2.42 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020be expressed using only non-blocking operators? 22 Vasiliki Kalavri | Boston University 2020 Model and formalization (I) A stream is a sequence of unbounded length, where tuples are ordered by their t ∈ S to denote that, for some 1 ≤ i ≤ n, ti = t. 23 Vasiliki Kalavri | Boston University 2020 Model and formalization (II) Pre-sequence (prefix): Let S = [t1, … ,tn] be a sequence and 0 < k ≤ n. Then streaming and static data. Requirements (or why SQL is not enough) • Push-based model as opposed to the pull-based model of SQL, i.e. an application or client asks for the query results when they need0 码力 | 53 页 | 532.37 KB | 1 年前3
Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020g-102 • Watermarks, Tables, Event Time, and the Dataflow Model: https:// www.confluent.jp/blog/watermarks-tables-event-time-dataflow-model/ Further reading 220 码力 | 22 页 | 2.22 MB | 1 年前3
监控Apache Flink应用程序(入门)..................................................................................... 16 4.13.1 Memory................................................................................................. 7/ops/config.html#configuring-the-network-buffers 8 https://www.da-platform.com/blog/manage-rocksdb-memory-size-apache-flink? __hstc=216506377.c9dc814ddd168ffc714fc8d2bf20623f. 1550652804788.1550652804788 metrics you want to look at are memory consumption and CPU load of your Task- & JobManager JVMs. 4.13.1 Memory Flink reports the usage of Heap, NonHeap, Direct & Mapped memory for JobManagers and TaskManagers0 码力 | 23 页 | 148.62 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020integer ru between 0 and 9 and add the user to the sample if ru = 0. Do we need to keep all users in memory? ??? Vasiliki Kalavri | Boston University 2020 We can use a hash function h to hash the user name Kalavri | Boston University 2020 28 Assume we expect around 1 billion elements and we have a fixed memory budget of 512MB • How many hash functions to use? • What would be the false positive rate? Kalavri | Boston University 2020 28 Assume we expect around 1 billion elements and we have a fixed memory budget of 512MB • How many hash functions to use? • What would be the false positive rate?0 码力 | 74 页 | 1.06 MB | 1 年前3
PyFlink 1.15 Documentationenvironments to use. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name=\ -pyclientexec could not meet. ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn.shi following: ./bin/flink run-application -t yarn-application \ -Djobmanager.memory.process.size=1024m \ -Dtaskmanager.memory.process.size=1024m \ -Dyarn.application.name= \ -Dyarn.shi 0 码力 | 36 页 | 266.77 KB | 1 年前3
共 19 条
- 1
- 2













