Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020structure but its order • FIFO, priority producer consumer queue 6 Message brokers Message broker: a system that connects event producers with event consumers. • It receives messages from the connects multiple producers to multiple consumers by organizing messages into topics. 7 Message Broker producer producer producer consumer consumer consumer - messages not removed after consumption handling, crashes or disconnects • Broker responsible for message durability • Asynchronous communication, i.e. producer only needs to receive ack from broker 9 Communication patterns (I) Load0 码力 | 33 页 | 700.14 KB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020| Boston University 2020 All data maintained by a task and used to compute results: a local or instance variable that is accessed by a task’s business logic Operator state is scoped to an operator Keyed state is scoped to a key defined in the operator’s input records • Flink maintains one state instance per key value and partitions all records with the same key to the operator task that maintains of a key attribute: • For each distinct value of the key attribute, Flink maintains one state instance. • The keyed state instances of a function are distributed across all parallel tasks of the0 码力 | 24 页 | 914.13 KB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020by an operator instance in deserialization, processing, and serialization activities. • excludes any time spent waiting on input or on output • amounts to the time an operator instance runs for if executed drops DS2 converges in 2 steps for both operators 1 2 Transient underpovisioning by 1 instance 28 DS2 scaling actions on Apache Flink wordcount 28 DS2 scaling actions on Apache Flink wordcount0 码力 | 93 页 | 2.42 MB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on sent first, then M1 will have a lower offset than M2 and appear earlier in the log. • A consumer instance sees records in the order they are stored in the log. • For a topic with replication factor N0 码力 | 26 页 | 3.33 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020processed • Ensure each worker is qualified: if load balancing is applied after fission, each instance must be capable of processing each item and have access to necessary state • Establish placement skew, e.g. when there exist popular keys • if there is skew, throughput is bounded by the instance that receives the highest load Load balancing Profitability A2 A1 split A2 A1 split ???0 码力 | 54 页 | 2.83 MB | 1 年前3
PyFlink 1.15 Documentationand simply selecting a column does not trigger the computation but it returns a Column Expression instance. [15]: from pyflink.table.expressions import col type(table.id)==type(col('id')) [15]: True 1 1 2 [17]: table.select(col('id')).to_pandas() [17]: id 0 1 1 2 Assign new Column Expression instance. [18]: table.add_columns(col('data').upper_case.alias('upper_data')).to_pandas() [18]: id data0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationand simply selecting a column does not trigger the computation but it returns a Column Expression instance. [15]: from pyflink.table.expressions import col type(table.id)==type(col('id')) [15]: True 1 1 2 [17]: table.select(col('id')).to_pandas() [17]: id 0 1 1 2 Assign new Column Expression instance. [18]: table.add_columns(col('data').upper_case.alias('upper_data')).to_pandas() [18]: id data0 码力 | 36 页 | 266.80 KB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020state for a particular key and all future events with this key must be routed to the same parallel instance • Some kind of hashing is typically used • Maintaining routing tables or an index for all key0 码力 | 41 页 | 4.09 MB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020on streams • Operators are data-parallel • distributed workers (threads) execute one parallel instance of one of more operators on disjoint data partitions 36 Vasiliki Kalavri | Boston University0 码力 | 45 页 | 1.22 MB | 1 年前3
共 9 条
- 1













