PyFlink 1.15 DocumentationPython environment It requires Python 3.6 or above with PyFlink pre-installed to be available on the nodes of the standalone cluster. It’s sug- gested to use Python virtual environments to set up the Python • Install Python virtual environments on all the cluster nodes in advance You could install Python virtual environments on all the cluster nodes with PyFlink pre-installed before submitting PyFlink jobs that there is already a Python virtual environment available at /path/to/venv on all the cluster nodes of the standaone cluster. It should be noted that options -pyclientexec and -pyexec are also required0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 DocumentationPython environment It requires Python 3.6 or above with PyFlink pre-installed to be available on the nodes of the standalone cluster. It’s sug- gested to use Python virtual environments to set up the Python • Install Python virtual environments on all the cluster nodes in advance You could install Python virtual environments on all the cluster nodes with PyFlink pre-installed before submitting PyFlink jobs that there is already a Python virtual environment available at /path/to/venv on all the cluster nodes of the standaone cluster. It should be noted that options -pyclientexec and -pyexec are also required0 码力 | 36 页 | 266.80 KB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 20202020 A simple system model stream sources N1 NK N2 … input queue output queue primary nodes secondary nodes other apps I1 I2 O1 O2 N’1 N’K N’2 … I’1 I’2 O’1 O’2 6 Vasiliki Kalavri | Boston disadvantage in this approach? Vasiliki Kalavri | Boston University 2020 Upstream Backup Upstream nodes act as backups for their downstream operators by logging tuples in their output queues until downstream completely processed them. 15 Vasiliki Kalavri | Boston University 2020 Upstream Backup Upstream nodes act as backups for their downstream operators by logging tuples in their output queues until downstream0 码力 | 49 页 | 2.08 MB | 1 年前3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020according to the rate the consumer recycles buffers. Remote exchange: If tasks run on different worker nodes, the buffer can be recycled as soon as it is on the TCP channel. • If there is no buffer on0 码力 | 43 页 | 2.42 MB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 Revisiting the basics 4 Dataflow graph • operators are nodes, data channels are edges • channels have FIFO semantics • streams of data elements flow continuously University 2020 38 Safety • Avoid starvation: every data item is eventually processed • Ensure each worker is qualified: if load balancing is applied after fission, each instance must be capable of processing0 码力 | 54 页 | 2.83 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020dataflow 1 2 3 6 5 4 ??? Vasiliki Kalavri | Boston University 2020 Dataflow worker activities worker 1 worker 2 worker 3 receive message deserialization processing serialization send message The DS2 model • Collect metrics per configurable observation window W • activity durations per worker • records processed Rprc and records pushed to output Rpsd 17 ??? Vasiliki Kalavri | Boston University The DS2 model • Collect metrics per configurable observation window W • activity durations per worker • records processed Rprc and records pushed to output Rpsd • Capture dependencies through the0 码力 | 93 页 | 2.42 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinkdriver to the worker. dstream.foreachRDD { rdd => val connection = createNewConnection() // executed at the driver rdd.foreach { record => connection.send(record) // executed at the worker } } 32 / driver to the worker. dstream.foreachRDD { rdd => val connection = createNewConnection() // executed at the driver rdd.foreach { record => connection.send(record) // executed at the worker } } 32 /0 码力 | 113 页 | 1.22 MB | 1 年前3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020required • Key semantics preserved: values of the same key are always processed by the same worker • Popular keys cause imbalance w2 w1 w3 ??? Vasiliki Kalavri | Boston University 2020 Addressing University 2020 Dynamic resource allocation • Choose one among n workers • check the load of each worker and send the item to the least loaded one • load checking for every item can be expensive • Choose previously • Partial key grouping maps each key to both choices: the partitioner sends the item to the worker with the currently lowest load • no routing history required • state needs to be merged to produce0 码力 | 31 页 | 1.47 MB | 1 年前3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020state • High migration cost • When a new node is added, state is shuffled across existing and new nodes • Random I/O and high network communication • Not suitable for adaptive applications 26 Uniform n1 n3 n2 0 2128 Nodes and data are mapped to a ring using the same hash function. Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 n1 n3 n2 0 2128 Nodes and data are mappedh Consistent hashing ??? Vasiliki Kalavri | Boston University 2020 n1 n3 n2 0 2128 Nodes and data are mapped to a ring using the same hash function. ei: h ek: h Consistent 0 码力 | 41 页 | 4.09 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020undirected unweighted graph, G = (V, E). • We want to estimate the distance between any pair of nodes u, v as the length of the shortest path between them. • A spanner H of graph G is a subgraph of University 2020 48 A k-spanner is a graph synopsis that preserves the distances between any pair of nodes up to a factor of k: ∀(u, v) ∈ V, dG(u, v) ≤ dH(u, v) ≤ k ⋅ dG(u, v) The k-spanner0 码力 | 72 页 | 7.77 MB | 1 年前3
共 12 条
- 1
- 2













