 Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020| Boston University 2020 Counting distinct elements 2 ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: webpages ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages hash table ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages0 码力 | 69 页 | 630.01 KB | 1 年前3 Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020| Boston University 2020 Counting distinct elements 2 ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: webpages ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages hash table ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages0 码力 | 69 页 | 630.01 KB | 1 年前3
 Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020Sampling streams 5 ??? Vasiliki Kalavri | Boston University 2020 6 A sample is a set of data elements selected via some random process Samples: the most fundamental synopses input stream add to sample of fixed size, e.g. s elements. 14 ??? Vasiliki Kalavri | Boston University 2020 Instead of a fixed proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 How can we continuously 2020 Instead of a fixed proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 How can we continuously maintain a representative fixed-size sample of the stream so far0 码力 | 74 页 | 1.06 MB | 1 年前3 Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020Sampling streams 5 ??? Vasiliki Kalavri | Boston University 2020 6 A sample is a set of data elements selected via some random process Samples: the most fundamental synopses input stream add to sample of fixed size, e.g. s elements. 14 ??? Vasiliki Kalavri | Boston University 2020 Instead of a fixed proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 How can we continuously 2020 Instead of a fixed proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 How can we continuously maintain a representative fixed-size sample of the stream so far0 码力 | 74 页 | 1.06 MB | 1 年前3
 Scalable Stream Processing - Spark Streaming and Flinksingle-element RDDs by counting the number of elements in each RDD of the source DStream. ▶ union • Returns a new DStream that contains the union of the elements in two DStreams. 22 / 79 Transformations single-element RDDs by counting the number of elements in each RDD of the source DStream. ▶ union • Returns a new DStream that contains the union of the elements in two DStreams. 22 / 79 Transformations Transformations (4/4) ▶ reduce • Returns a new DStream of single-element RDDs by aggregating the elements in each RDD using a given function. ▶ reduceByKey • Returns a new DStream of (K, V) pairs where the values0 码力 | 113 页 | 1.22 MB | 1 年前3 Scalable Stream Processing - Spark Streaming and Flinksingle-element RDDs by counting the number of elements in each RDD of the source DStream. ▶ union • Returns a new DStream that contains the union of the elements in two DStreams. 22 / 79 Transformations single-element RDDs by counting the number of elements in each RDD of the source DStream. ▶ union • Returns a new DStream that contains the union of the elements in two DStreams. 22 / 79 Transformations Transformations (4/4) ▶ reduce • Returns a new DStream of single-element RDDs by aggregating the elements in each RDD using a given function. ▶ reduceByKey • Returns a new DStream of (K, V) pairs where the values0 码力 | 113 页 | 1.22 MB | 1 年前3
 Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020window operator, you need to specify two window components: • A window assigner determines how the elements of the input stream are grouped into windows. A window assigner produces a WindowedStream (or • A window function is applied on a WindowedStream (or AllWindowedStream) and processes the elements assigned to a window. 5 Keyed vs. non-keyed windows Vasiliki Kalavri | Boston University 2020 Kalavri | Boston University 2020 Window functions define the computation that is performed on the elements of a window • Incremental aggregation functions are applied when an element is added to a window:0 码力 | 35 页 | 444.84 KB | 1 年前3 Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020window operator, you need to specify two window components: • A window assigner determines how the elements of the input stream are grouped into windows. A window assigner produces a WindowedStream (or • A window function is applied on a WindowedStream (or AllWindowedStream) and processes the elements assigned to a window. 5 Keyed vs. non-keyed windows Vasiliki Kalavri | Boston University 2020 Kalavri | Boston University 2020 Window functions define the computation that is performed on the elements of a window • Incremental aggregation functions are applied when an element is added to a window:0 码力 | 35 页 | 444.84 KB | 1 年前3
 Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 Select IStream(*) From S1 [Rows 5], S2 [Rows 10] Where S1.A = S2.A Last 5 elements of stream S1 and last 10 elements of S2 stream-to-relation relation-to-relation relation-to-stream CQL Example Vasiliki Kalavri | Boston University 2020 Operator types (I) • Single-Item Operators process stream elements one-by-one. • selection, filtering, projection, renaming. • Logic Operators define rules for two streams by matching elements satisfying a condition • commonly applied on windows • Union operators combine two or more streams without ordering guarantees • elements have to be of the same0 码力 | 53 页 | 532.37 KB | 1 年前3 Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 Select IStream(*) From S1 [Rows 5], S2 [Rows 10] Where S1.A = S2.A Last 5 elements of stream S1 and last 10 elements of S2 stream-to-relation relation-to-relation relation-to-stream CQL Example Vasiliki Kalavri | Boston University 2020 Operator types (I) • Single-Item Operators process stream elements one-by-one. • selection, filtering, projection, renaming. • Logic Operators define rules for two streams by matching elements satisfying a condition • commonly applied on windows • Union operators combine two or more streams without ordering guarantees • elements have to be of the same0 码力 | 53 页 | 532.37 KB | 1 年前3
 Streaming optimizations	- CS 591 K1: Data Stream Processing and Analytics Spring 2020operators are nodes, data channels are edges • channels have FIFO semantics • streams of data elements flow continuously along edges Operators • receive one or more input streams • perform tuple-at-a-time Kalavri | Boston University 2020 Operator selectivity 6 • The number of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element state based on a key attribute • Ensure ordering constraints: if downstream operator expects elements in a particular order, merging should handle that • Avoid deadlocks: if split cannot push data0 码力 | 54 页 | 2.83 MB | 1 年前3 Streaming optimizations	- CS 591 K1: Data Stream Processing and Analytics Spring 2020operators are nodes, data channels are edges • channels have FIFO semantics • streams of data elements flow continuously along edges Operators • receive one or more input streams • perform tuple-at-a-time Kalavri | Boston University 2020 Operator selectivity 6 • The number of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element state based on a key attribute • Ensure ordering constraints: if downstream operator expects elements in a particular order, merging should handle that • Avoid deadlocks: if split cannot push data0 码力 | 54 页 | 2.83 MB | 1 年前3
 Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020unbounded • we cannot store the entire stream in an accessible way • we have to process stream elements on-the-fly using limited memory 2 Vasiliki Kalavri | Boston University 2020 Properties of data Streams as evolving relations • A stream is interpreted as describing a changing relation. • Stream elements bear a valid timestamp, Vs, after which they are considered valid and they can contribute to the synopsis solution • They are purpose-built and query-specific • different synopsis to count distinct elements than to keep track of top-K events 33 Vasiliki Kalavri | Boston University 2020 Dataflow Streaming0 码力 | 45 页 | 1.22 MB | 1 年前3 Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020unbounded • we cannot store the entire stream in an accessible way • we have to process stream elements on-the-fly using limited memory 2 Vasiliki Kalavri | Boston University 2020 Properties of data Streams as evolving relations • A stream is interpreted as describing a changing relation. • Stream elements bear a valid timestamp, Vs, after which they are considered valid and they can contribute to the synopsis solution • They are purpose-built and query-specific • different synopsis to count distinct elements than to keep track of top-K events 33 Vasiliki Kalavri | Boston University 2020 Dataflow Streaming0 码力 | 45 页 | 1.22 MB | 1 年前3
 Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Find all items x in a data stream such that: • freq(x) > δ*N, where N is the number of stream elements • The solution will not contain any item y with frequency: • freq(y) < (δ - ε)*N, for a user-chosen we remove infrequent elements. 6 ??? Vasiliki Kalavri | Boston University 2020 Lossy counting algorithm D = {} // empty list wcur = 1 // first window id N = 0 // elements seen so far Insert step = N + 1 Delete step Iterate over D and remove every element x with fx + εx ≤ wcur Output: elements in D with fx ≥ (δ - ε) * N 7 ??? Vasiliki Kalavri | Boston University 2020 Example 8 1 2 20 码力 | 31 页 | 1.47 MB | 1 年前3 Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Find all items x in a data stream such that: • freq(x) > δ*N, where N is the number of stream elements • The solution will not contain any item y with frequency: • freq(y) < (δ - ε)*N, for a user-chosen we remove infrequent elements. 6 ??? Vasiliki Kalavri | Boston University 2020 Lossy counting algorithm D = {} // empty list wcur = 1 // first window id N = 0 // elements seen so far Insert step = N + 1 Delete step Iterate over D and remove every element x with fx + εx ≤ wcur Output: elements in D with fx ≥ (δ - ε) * N 7 ??? Vasiliki Kalavri | Boston University 2020 Example 8 1 2 20 码力 | 31 页 | 1.47 MB | 1 年前3
 PyFlink 1.15 Documentationunion them. Firstly, you can create a Table from a Python List Object [3]: table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')]) table.get_schema() [3]: root |-- _1: BIGINT |-- _2: STRING Create a pyflink-docs, Release release-1.15 [4]: from pyflink.table import DataTypes table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')], DataTypes.ROW([DataTypes.FIELD("id", DataTypes. ˓→TINYINT()), DataTypes [9]: # prepare the catalog # register Table API tables in the catalog old_table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')], ['id', 'data']) table_env.create_temporary_view('source_table', old_table)0 码力 | 36 页 | 266.77 KB | 1 年前3 PyFlink 1.15 Documentationunion them. Firstly, you can create a Table from a Python List Object [3]: table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')]) table.get_schema() [3]: root |-- _1: BIGINT |-- _2: STRING Create a pyflink-docs, Release release-1.15 [4]: from pyflink.table import DataTypes table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')], DataTypes.ROW([DataTypes.FIELD("id", DataTypes. ˓→TINYINT()), DataTypes [9]: # prepare the catalog # register Table API tables in the catalog old_table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')], ['id', 'data']) table_env.create_temporary_view('source_table', old_table)0 码力 | 36 页 | 266.77 KB | 1 年前3
 PyFlink 1.16 Documentationunion them. Firstly, you can create a Table from a Python List Object [3]: table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')]) table.get_schema() [3]: root |-- _1: BIGINT |-- _2: STRING Create a pyflink-docs, Release release-1.16 [4]: from pyflink.table import DataTypes table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')], DataTypes.ROW([DataTypes.FIELD("id", DataTypes. ˓→TINYINT()), DataTypes [9]: # prepare the catalog # register Table API tables in the catalog old_table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')], ['id', 'data']) table_env.create_temporary_view('source_table', old_table)0 码力 | 36 页 | 266.80 KB | 1 年前3 PyFlink 1.16 Documentationunion them. Firstly, you can create a Table from a Python List Object [3]: table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')]) table.get_schema() [3]: root |-- _1: BIGINT |-- _2: STRING Create a pyflink-docs, Release release-1.16 [4]: from pyflink.table import DataTypes table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')], DataTypes.ROW([DataTypes.FIELD("id", DataTypes. ˓→TINYINT()), DataTypes [9]: # prepare the catalog # register Table API tables in the catalog old_table = table_env.from_elements([(1, 'Hi'), (2, 'Hello')], ['id', 'data']) table_env.create_temporary_view('source_table', old_table)0 码力 | 36 页 | 266.80 KB | 1 年前3
共 13 条
- 1
- 2













