Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 Sampling streams 5 ??? Vasiliki Kalavri | Boston University 2020 6 A sample is a set of data elements selected via some random process Samples: the most fundamental synopses input stream 1/10th of the stream, we select a stream element i with probability 10%. • We can use a random generator that produces an integer ri between 0 and 9. We then select an input element i if ri=0. 8 Q: 1/10th of the stream, we select a stream element i with probability 10%. • We can use a random generator that produces an integer ri between 0 and 9. We then select an input element i if ri=0. 8 Will0 码力 | 74 页 | 1.06 MB | 1 年前3
PyFlink 1.15 DocumentationLocal This page shows you how to set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires pre-installed to be available in your local environment. It’s suggested to use Python virtual environments to set up your local Python environment. See Create a Python virtual environment for more details on how to barebone way of deploying Flink. This page shows you how to set up Python envi- ronment and execute PyFlink jobs in a standalone Flink cluster. Set up Python environment It requires Python 3.6 or above with0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationLocal This page shows you how to set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires pre-installed to be available in your local environment. It’s suggested to use Python virtual environments to set up your local Python environment. See Create a Python virtual environment for more details on how to barebone way of deploying Flink. This page shows you how to set up Python envi- ronment and execute PyFlink jobs in a standalone Flink cluster. Set up Python environment It requires Python 3.6 or above with0 码力 | 36 页 | 266.80 KB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Boston University 2020 Operator types (II) • Sequence Operators capture the arrival of an ordered set of events. • common in pattern languages • events must have associated timestamps • Iteration Blocking query operator can only return answers when it detects the end of its input. • NOT IN, set difference and division, traditional SQL aggregates • A Non-blocking query operator can produce answers GROUP BY DeptNo HAVING SUM(empl.Sal) > 10000 The introduction of a new empl can only expand the set of departments that satisfy this query However this sum query cannot be expressed without the use0 码力 | 53 页 | 532.37 KB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 Streaming Connected Components • State: a disjoint set (union-find) data structure for the components • it stores a set of elements partitioned in disjoint subsets • Single-pass computation: Connected Components 36 1. partition the edge stream, e.g. by source Id 2. maintain a disjoint set in each partition 3. periodically merge the partial disjoint sets into a global one ??? Vasiliki 2 5 6 7 8 ??? Vasiliki Kalavri | Boston University 2020 A graph is bipartite if its vertex set can be divided into two disjoint independent sets U, V, such that every edge connects a vertex in0 码力 | 72 页 | 7.77 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 15 Safety • Attribute availability: the set of attributes B reads from must be disjoint from the set of attributes A writes to. • Commutativity: the results of applying Kalavri | Boston University 2020 18 Safety • attribute availability: the set of attributes B reads from must be disjoint from the set of attributes A writes to. • commutativity: the results of applying build one dataflow graph for several queries • when applications analyze data streams from a small set of sources • Operator elimination • remove a no-op, e.g. a projection that keeps all attributes0 码力 | 54 页 | 2.83 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkfrequency in each RDD of the source DStream. 23 / 79 Window Operations (1/3) ▶ Spark provides a set of transformations that apply to a over a sliding window of data. ▶ A window is defined by two parameters: these operation is proportional to the size of the state. ▶ mapWithState • It is executed only on set of keys that are available in the last micro batch. • The performance is proportional to the size these operation is proportional to the size of the state. ▶ mapWithState • It is executed only on set of keys that are available in the last micro batch. • The performance is proportional to the size0 码力 | 113 页 | 1.22 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020probably in the set if false, it definitely isn’t Vasiliki Kalavri | Boston University 2020 21 http://streamingbook.net/fig/5-5 Bloom filter: if true, the element is probably in the set if false, it 2020 21 http://streamingbook.net/fig/5-5 Bloom filter: if true, the element is probably in the set if false, it definitely isn’t Separate bloom filters for every 10-minute range to avoid saturation 2020 21 http://streamingbook.net/fig/5-5 Bloom filter: if true, the element is probably in the set if false, it definitely isn’t Separate bloom filters for every 10-minute range to avoid saturation0 码力 | 49 页 | 2.08 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 Basic API Concept Source Data Stream Operator Data Stream Sink Source Data Set Operator Data Set Sink Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks execute("Compute max sensor temperature”) } } Flink programs are defined in regular Scala/Java methods Set up the execution environment: local, cluster, I/O, time semantics, parallelism, … Example: Sensor0 码力 | 26 页 | 3.33 MB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020• Restart it with an adjusted parallelism • The state is automatically redistributed to the new set of parallel tasks • For exactly-once results, we need to prevent a checkpoint to complete after Kalavri | Boston University 2020 val env = StreamExecutionEnvironment.getExecutionEnvironment // set the maximum parallelism for this application env.setMaxParallelism(512) val alerts: DataStream[(String DataStream[(String, Double, Double)] = keyedSensorData .flatMap(new TemperatureAlertFunction(1.1)) // set the maximum parallelism for this operator .setMaxParallelism(1024) 33 Setting the max parallelism ??? Vasiliki0 码力 | 41 页 | 4.09 MB | 1 年前3
共 20 条
- 1
- 2













