Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020logical conjunction • if A is a projection on multiple attributes • if A is an idempotent aggregation Operator separation A A2 A1 Separate operators into smaller computational steps • beneficial Boston University 2020 24 • Cost of Merge = 0.5 • Cost of A = 0.5 • Splitting A allows a pre-aggregation similar to what combiners do in MapReduce Operator separation merge X merge A A X merge micro-batches D-Streams • During an interval, input data received is stored using RDDs • A D-Stream is a group of such RDDs which can be processed using common operators 45 Example • pageViews is a D-Stream0 码力 | 54 页 | 2.83 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and FlinkKinesis, ... TwitterUtils.createStream(ssc, None) KafkaUtils.createStream(ssc, [ZK quorum], [consumer group id], [number of partitions]) 15 / 79 Input Operations - Custom Sources (1/3) ▶ To create a custom express it as the following SQL query. SELECT action, WINDOW(time, "1 hour"), COUNT * FROM events GROUP BY action, WINDOW(time, "1 hour") 61 / 79 Structured Streaming Example (3/3) val inputDF = spark where("id > 10") // using untyped APIs ds.filter(_.id > 10).map(_.action) // using typed APIs // Aggregation df.groupBy("action") // using untyped API ds.groupByKey(_.action) // using typed API // SQL commands0 码力 | 113 页 | 1.22 MB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020val sensorData: DataStream[SensorReading] = ... val avgTemp = sensorData .keyBy(_.id) // group readings in 1s event-time windows .window(TumblingEventTimeWindows.of(Time.seconds(1))) .process(new functions define the computation that is performed on the elements of a window • Incremental aggregation functions are applied when an element is added to a window: • They maintain a single value as0 码力 | 35 页 | 444.84 KB | 1 年前3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020measurements analysis • Monitoring applications • Complex filtering and alarm activation • Aggregation of multiple sensors and joins • Examples • Real-time statistics, e.g. weather maps • Monitor activity analysis • Visualization and aggregation • impressions, clicks, transactions, likes, comments • Analytics on user activity • Filtering, aggregation, joins with static data (e.g. user profile0 码力 | 34 页 | 2.53 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020given the constraint that system throughput matches the data input rate • In the case of known aggregation functions, results can be scaled using approximate query processing techniques, where accuracy a data stream manager. (VLDB ’03) • N. Tatbul and S. Zdonik. Window-aware load shedding for aggregation queries over data streams. (VLDB’06) • N. Tatbul, U. Çetintemel, and S. Zdonik. Staying fit:0 码力 | 43 页 | 2.42 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020Operator replicates a stream, commonly to be used as input to multiple downstream operators. • Group by / Partition Operators split a stream into sub-streams according to a function or the event contents Kalavri | Boston University 2020 CQL GroupBy Example Select IStream(Count(*)) From S1 [Rows 1000] Group By S1.B Count the number or events in the last 1000 rows for each value of B 20 Vasiliki Kalavri University 2020 Some queries expressed using aggregates are monotonic: SELECT DeptNo FROM empl GROUP BY DeptNo HAVING SUM(empl.Sal) > 10000 The introduction of a new empl can only expand the set0 码力 | 53 页 | 532.37 KB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate separate processes or on separate machines. If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances. If all the consumer0 码力 | 26 页 | 3.33 MB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020topic T can be viewed as becoming a member of a group T. • Publishing an event on topic T can be viewed as broadcasting the event to all members of group T. • Topic hierarchies allow topic organization0 码力 | 33 页 | 700.14 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020as ranges • On restore, reads are sequential within each key-group, and often across multiple key-groups • The metadata of key-group-to-subtask assignments are small. No need to maintain explicit0 码力 | 41 页 | 4.09 MB | 1 年前3
Apache Flink的过去、现在和未来---------------------------- Stream Mode: 12:01> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name; Flink 在阿里的服务情况 集群规模 超万台 状态数据 PetaBytes 事件处理 十万亿/天 峰值能力 17亿/秒 Flink 的过去 offline Real-time0 码力 | 33 页 | 3.36 MB | 1 年前3
共 12 条
- 1
- 2













