Group Aggregation - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

logical conjunction • if A is a projection on multiple attributes • if A is an idempotent aggregation Operator separation A A2 A1 Separate operators into smaller computational steps • beneficial Boston University 2020 24 • Cost of Merge = 0.5 • Cost of A = 0.5 • Splitting A allows a pre-aggregation similar to what combiners do in MapReduce Operator separation merge X merge A A X merge micro-batches D-Streams • During an interval, input data received is stored using RDDs • A D-Stream is a group of such RDDs which can be processed using common operators 45 Example • pageViews is a D-Stream

0 码力 | 54 页 | 2.83 MB | 1 年前
3
Scalable Stream Processing - Spark Streaming and Flink

Kinesis, ... TwitterUtils.createStream(ssc, None) KafkaUtils.createStream(ssc, [ZK quorum], [consumer group id], [number of partitions]) 15 / 79 Input Operations - Custom Sources (1/3) ▶ To create a custom express it as the following SQL query. SELECT action, WINDOW(time, "1 hour"), COUNT * FROM events GROUP BY action, WINDOW(time, "1 hour") 61 / 79 Structured Streaming Example (3/3) val inputDF = spark where("id > 10") // using untyped APIs ds.filter(_.id > 10).map(_.action) // using typed APIs // Aggregation df.groupBy("action") // using untyped API ds.groupByKey(_.action) // using typed API // SQL commands

0 码力 | 113 页 | 1.22 MB | 1 年前
3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020

val sensorData: DataStream[SensorReading] = ... val avgTemp = sensorData .keyBy(_.id) // group readings in 1s event-time windows .window(TumblingEventTimeWindows.of(Time.seconds(1))) .process(new functions define the computation that is performed on the elements of a window • Incremental aggregation functions are applied when an element is added to a window: • They maintain a single value as

0 码力 | 35 页 | 444.84 KB | 1 年前
3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

measurements analysis • Monitoring applications • Complex filtering and alarm activation • Aggregation of multiple sensors and joins • Examples • Real-time statistics, e.g. weather maps • Monitor activity analysis • Visualization and aggregation • impressions, clicks, transactions, likes, comments • Analytics on user activity • Filtering, aggregation, joins with static data (e.g. user profile

0 码力 | 34 页 | 2.53 MB | 1 年前
3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

given the constraint that system throughput matches the data input rate • In the case of known aggregation functions, results can be scaled using approximate query processing techniques, where accuracy a data stream manager. (VLDB ’03) • N. Tatbul and S. Zdonik. Window-aware load shedding for aggregation queries over data streams. (VLDB’06) • N. Tatbul, U. Çetintemel, and S. Zdonik. Staying fit:

0 码力 | 43 页 | 2.42 MB | 1 年前
3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Operator replicates a stream, commonly to be used as input to multiple downstream operators. • Group by / Partition Operators split a stream into sub-streams according to a function or the event contents Kalavri | Boston University 2020 CQL GroupBy Example Select IStream(Count(*)) From S1 [Rows 1000] Group By S1.B Count the number or events in the last 1000 rows for each value of B 20 Vasiliki Kalavri University 2020 Some queries expressed using aggregates are monotonic: SELECT DeptNo FROM empl GROUP BY DeptNo HAVING SUM(empl.Sal) > 10000 The introduction of a new empl can only expand the set

0 码力 | 53 页 | 532.37 KB | 1 年前
3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate separate processes or on separate machines. If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances. If all the consumer

0 码力 | 26 页 | 3.33 MB | 1 年前
3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

topic T can be viewed as becoming a member of a group T. • Publishing an event on topic T can be viewed as broadcasting the event to all members of group T. • Topic hierarchies allow topic organization

0 码力 | 33 页 | 700.14 KB | 1 年前
3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

as ranges • On restore, reads are sequential within each key-group, and often across multiple key-groups • The metadata of key-group-to-subtask assignments are small. No need to maintain explicit

0 码力 | 41 页 | 4.09 MB | 1 年前
3
Apache Flink的过去、现在和未来

---------------------------- Stream Mode: 12:01> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name; Flink 在阿里的服务情况集群规模超万台状态数据 PetaBytes 事件处理十万亿/天峰值能力 17亿/秒 Flink 的过去 offline Real-time

0 码力 | 33 页 | 3.36 MB | 1 年前
3

共 12 条前往

页

分类

语言

格式

Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Scalable Stream Processing - Spark Streaming and Flink

Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Apache Flink的过去、现在和未来