Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020Kalavri | Boston University 2020 Let h be a hash function that maps each stream element into M = log2N bits, where N is the domain of input elements: For each element x, let rank(x) be the number maximum value of rank(.) seen so far. ̂n = 2R Claim: The maximum observed rank is a good estimate of log2n. In other words, the estimated number of distinct elements is equal to: ??? Vasiliki Kalavri | Kalavri | Boston University 2020 10 Stochastic averaging Use one hash function to simulate many by splitting the hash value into two parts ??? Vasiliki Kalavri | Boston University 2020 10 We split the input0 码力 | 69 页 | 630.01 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020?? Vasiliki Kalavri | Boston University 2020 24 • Cost of Merge = 0.5 • Cost of A = 0.5 • Splitting A allows a pre-aggregation similar to what combiners do in MapReduce Operator separation merge0 码力 | 54 页 | 2.83 MB | 1 年前3
PyFlink 1.15 Documentationthere are any problems, you could perform the following checks. Check the logging messages in the log file to see if there are any problems: # Get the installation directory of PyFlink python3 -c "import /path/to/python/site-packages/pyflink # Check the logging under the log directory ls -lh /path/to/python/site-packages/pyflink/log # You will see the log file as following: (continues on next page) 1.1. Getting previous page) # -rw-r--r-- 1 dianfu staff 45K 10 18 20:54 flink-dianfu-python-B-7174MD6R-1908. ˓→local.log Besides, you could also check if the files of the PyFlink package are consistent. It may happen that0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationthere are any problems, you could perform the following checks. Check the logging messages in the log file to see if there are any problems: # Get the installation directory of PyFlink python3 -c "import /path/to/python/site-packages/pyflink # Check the logging under the log directory ls -lh /path/to/python/site-packages/pyflink/log # You will see the log file as following: (continues on next page) 1.1. Getting previous page) # -rw-r--r-- 1 dianfu staff 45K 10 18 20:54 flink-dianfu-python-B-7174MD6R-1908. ˓→local.log Besides, you could also check if the files of the PyFlink package are consistent. It may happen that0 码力 | 36 页 | 266.80 KB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020subscription's queue of messages. 25 Log-structured brokers Logs as message brokers • In typical message brokers, once a message is consumed it is deleted • Log-based message brokers take a different partitioned) log • A log is an append-only sequence of records on disk • a producer generates messages by simply appending them to the log and a consumer receives messages by reading the log sequentially sequentially 27 Partitions and offsets • A log can be partitioned, so that each partition can be read and written independently of others • a topic is a set of partitions • Within each partition, every0 码力 | 33 页 | 700.14 KB | 1 年前3
Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020the Kafka cluster maintains a partitioned log. Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. An offset is a sequential id number time to free up disk space. 22 Vasiliki Kalavri | Boston University 2020 23 Partitions allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on will have a lower offset than M2 and appear earlier in the log. • A consumer instance sees records in the order they are stored in the log. • For a topic with replication factor N, we will tolerate0 码力 | 26 页 | 3.33 MB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020identifiers of the last tuples they received. • In upstream backup, operators need to track and log tuple provenance / result lineage. Can such techniques be efficiently implemented? What if more0 码力 | 49 页 | 2.08 MB | 1 年前3
Flink如何实时分析Iceberg数据湖的CDC数据CDsBPHNRHML, UDHFGR 1RO6 mysOl_AHLlMF; FHRGSA.BMm/TDPTDPHBa/ElHLI-BCB-BMLLDBRMPs Flink 原生支持 Change Log Stream A C D E F G INSERT DELETE UPDATE INSERT DELETE UPDATE INSERT F3152 + Icebe7g0 码力 | 36 页 | 781.69 KB | 1 年前3
Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020algorithm terminates D contains an item x if its actual frequency is fx > ε*N Worst case: O( 1 ε * log(εN)) counters ??? Vasiliki Kalavri | Boston University 2020 The power of two choices • Instead,0 码力 | 31 页 | 1.47 MB | 1 年前3
共 9 条
- 1













