Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020visiting one or multiple webpages Naive solution: maintain a hash table Convert the stream into a multi-set of uniformly distributed random numbers using a hash function. ??? Vasiliki Kalavri | Boston University encounter in the stream, the more different hash values we shall see. Convert the stream into a multi-set of uniformly distributed random numbers using a hash function. ??? Vasiliki Kalavri | Boston University to count cardinalities up to 1 billion or 230 with an accuracy of 4%. • The hash value needs to map elements to M = log2(230) = 30 bits. Space requirements ??? Vasiliki Kalavri | Boston University0 码力 | 69 页 | 630.01 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020Operator selectivity 6 • The number of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element for each input element it processes Kalavri | Boston University 2020 15 Safety • Attribute availability: the set of attributes B reads from must be disjoint from the set of attributes A writes to. • Commutativity: the results of applying Kalavri | Boston University 2020 18 Safety • attribute availability: the set of attributes B reads from must be disjoint from the set of attributes A writes to. • commutativity: the results of applying0 码力 | 54 页 | 2.83 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 Basic API Concept Source Data Stream Operator Data Stream Sink Source Data Set Operator Data Set Sink Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks Kalavri | Boston University 2020 Streaming word count textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .sum(1) .print() “live and let live” “live” “and” “let” “live” (live nt val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .max("temp") maxTemp0 码力 | 26 页 | 3.33 MB | 1 年前3
PyFlink 1.15 DocumentationLocal This page shows you how to set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires pre-installed to be available in your local environment. It’s suggested to use Python virtual environments to set up your local Python environment. See Create a Python virtual environment for more details on how to barebone way of deploying Flink. This page shows you how to set up Python envi- ronment and execute PyFlink jobs in a standalone Flink cluster. Set up Python environment It requires Python 3.6 or above with0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationLocal This page shows you how to set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires pre-installed to be available in your local environment. It’s suggested to use Python virtual environments to set up your local Python environment. See Create a Python virtual environment for more details on how to barebone way of deploying Flink. This page shows you how to set up Python envi- ronment and execute PyFlink jobs in a standalone Flink cluster. Set up Python environment It requires Python 3.6 or above with0 码力 | 36 页 | 266.80 KB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flink20 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 21 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to 21 / 79 Transformations (2/4) ▶ map • Returns a new DStream by passing each element of the source DStream through a given function. ▶ flatMap • Similar to map, but each input item can be mapped to0 码力 | 113 页 | 1.22 MB | 1 年前3
Streaming in Apache FlinkCommitter @ Apache Flink @SenorCarbone Contents • DataSet API • DataStream API • Concepts • Set up an environment to develop Flink programs • Implement streaming data processing pipelines • Flink } } Map Function DataStreamrides = env.addSource(new TaxiRideSource(...)); DataStream enrichedNYCRides = rides .filter(new RideCleansing.NYCFilter()) .map(new Enrichment()); class Enrichment implements MapFunction { @Override public EnrichedRide map(TaxiRide taxiRide) throws Exception { return new EnrichedRide(taxiRide); } } FlatMap Function 0 码力 | 45 页 | 3.00 MB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .timeWindow(Time.minutes(1)) time characteristic 4 object AverageSensorReadings { def main(args: Array[String]) { // set up the streaming execution environment val env = StreamExecutionEnvironment.getExecutionEnvironment Kalavri | Boston University 2020 val minTempPerWindow: DataStream[(String, Double)] = sensorData .map(r => (r.id, r.temperature)) .keyBy(_._1) .timeWindow(Time.seconds(15)) .reduce((r1, r2)0 码力 | 35 页 | 444.84 KB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 Load shedding as an optimization problem N: query network I: set of input streams with known arrival rates C: system processing capacity H: headroom factor, i.e selectivity 11 • Selectivity: how many records does the operator produce per record in its input? • map: 1 in 1 out • filter: 1 in, 1 or 0 out • flatMap, join: 1 in 0, 1, or more out • Cost: how many connected to multiple queries. 14 ??? Vasiliki Kalavri | Boston University 2020 Load Shedding Road Map (LSRM) • A pre-computed table that contains materialized load shedding plans ordered by how much0 码力 | 43 页 | 2.42 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020know the entire dataset in advance, e.g. tables stored in a database. A data stream is a data set that is produced incrementally over time, rather than being available in full before its processing associated keys. 38 Distributed dataflow model Vasiliki Kalavri | Boston University 2020 topK map print source w1 w2 w3 w6 w4 w5 w8 w7 Twitter source Extract hashtags Count topics Trends getExecutionEnvironment val sensorData = env.addSource(new SensorSource) val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .max("temp") maxTemp0 码力 | 45 页 | 1.22 MB | 1 年前3
共 21 条
- 1
- 2
- 3













