Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020right then. The next morning, in the shower, I came up with the solution. When I arrived at Chandy's office, he was waiting for me with the same solution. http://lamport.azurewebsites.net/pubs/pubsAB6HicbVBNSwMxEJ34WetX1aOXYBE 9lV0R1FvRi8cqri20S8m2TY0yS5JVihL/4EXDype/Une/Dem7R609cHA470ZuZFqeDGet43WlpeWV1bL2UN7e2d3Yre/uPJsk0ZQFNRKJbETFMcMUCy61g AB6HicbVBNSwMxEJ34WetX1aOXYBE 9lV0R1FvRi8cqri20S8m2TY0yS5JVihL/4EXDype/Une/Dem7R609cHA470ZuZFqeDGet43WlpeWV1bL2UN7e2d3Yre/uPJsk0ZQFNRKJbETFMcMUCy61g 0 码力 | 81 页 | 13.18 MB | 1 年前3
Flink如何实时分析Iceberg数据湖的CDC数据数据p 、支持L时更新数据,时效性佳。 2、CK加速,适合OLAP分析。 方案评估 优点 、cedKudup群,a较小众。维护 O本q。 2、H HDFS / S3 / OSS 等D裂。数据c e,且KAO本不如S3 / OSS。 3、Kudud批量P描不如3ar4u1t。 4、不支持增量SF。 h点 直接D入CDC到Hi2+分析 、流程能E作 2、Hi2+存量数据不受增量数据H响。 2、每次数据D致都要 MERGE 存量数据 。T+ 方GT新3R效性差。 3、不M持CR1ps+rt。 缺点 SCaDk + )=AFa IL()(数据 MER,E .NTO GE=DE US.N, chan>=E ON GE=DE.GE=D.< = chan>=E.GE=D.< WHEN MAT(HE) AN) +LA,=H)H THEN )ELETE WHEN MAT(HE) AN) +LA,<>H)H (chan>=E.GE=D.<, chan>=E.a<S1a2k + D+/4a CaCDC数据 1、仅依t S1a2k+D+/4a,架构简e。 2、无在k服务。l护和运nS本低。 2、D存存储,Ca速O快。 3、方便上S3 OSS,超高性价比。 方案s估 优点 1、增量和全量表割p,时效性不足。 2、r计和l护额外hChang+ S+4表。 3、计算引擎并非原g支UCDC。 4、不支U实时U13+24。 0 码力 | 36 页 | 781.69 KB | 1 年前3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020availability, recovery semantics, and guarantees Vasiliki Kalavri | Boston University 2020 Today’s topics • High-availability and fault-tolerance in distributed stream processing • Recovery semantics much input do we need to re-play? How expensive is it to re-construct the state? How fast can we de-duplicate output? Vasiliki Kalavri | Boston University 2020 Gap Recovery • Restart the operator output results • Watermarks are used to identify duplicate output tuples and trim the secondary’s output queue • Negligible recovery time • High overhead since all processing is duplicated Ni0 码力 | 49 页 | 2.08 MB | 1 年前3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020“conservative” “liberal” If you like “Inside job” you might also like “The Bourne Identity” What’s the cheapest way to reach Zurich from London through Berlin? These are the top-10 relevant results Stefani, Lorenzo De, et al. Triest: Counting local and global triangles in fully dynamic streams with fixed memory size. TKDD 2017. https://www.kdd.org/ kdd2016/papers/files/rfp0465-de-stefaniA.pdf Further0 码力 | 72 页 | 7.77 MB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020maintained by a task and used to compute results: a local or instance variable that is accessed by a task’s business logic Operator state is scoped to an operator task, i.e. records processed by the same parallel parallel tasks of the same or different operators Keyed state is scoped to a key defined in the operator’s input records • Flink maintains one state instance per key value and partitions all records with the state as regular objects on TaskManager’s heap • Low read/write latencies • OutOfMemoryError if large grows too large, GC pauses • Checkpoints sent to JobManager's heap memory, i.e. the state is lost in0 码力 | 24 页 | 914.13 KB | 1 年前3
Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020unsubscribe advertise(): information reg. future events Publish/Subscribe Systems 17 Pub/Sub levels of de-coupling • Space: interacting parties do not need to know each other • Publishers do not know who service. When a message is acknowledged by the subscriber, it is removed from the subscription's queue of messages. 25 Log-structured brokers Logs as message brokers • In typical message brokers different partitions to different consumers • Limits on maximum parallelism: the number of the topic's partitions • Processing delays: If a message is slow to process, this delays processing of subsequent0 码力 | 33 页 | 700.14 KB | 1 年前3
PyFlink 1.15 Documentationalso useful when deploying a PyFlink job to production when there are massive Python dependencies. It’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management requires Python 3.6 or above with PyFlink pre-installed to be available in your local environment. It’s suggested to use Python virtual environments to set up your local Python environment. See Create a Python 3.6 or above with PyFlink pre-installed to be available on the nodes of the standalone cluster. It’s sug- gested to use Python virtual environments to set up the Python environment. See Create a Python0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationalso useful when deploying a PyFlink job to production when there are massive Python dependencies. It’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management requires Python 3.6 or above with PyFlink pre-installed to be available in your local environment. It’s suggested to use Python virtual environments to set up your local Python environment. See Create a Python 3.6 or above with PyFlink pre-installed to be available on the nodes of the standalone cluster. It’s sug- gested to use Python virtual environments to set up the Python environment. See Create a Python0 码力 | 36 页 | 266.80 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020last month: • s of those are unique • d of those are duplicates • no query was issued more than twice 9 How many of Ted’s queries will be in the 1/10th sample, S? Each of the s unique queries has selected: • an expected number of s/10 of those queries will be in S. ??? Vasiliki Kalavri | Boston University 2020 10 How many of Ted’s queries will be in the 1/10th sample, S? What about the duplicates How many of Ted’s queries will be in the 1/10th sample, S? What about the duplicates, d ? Probability that both occurrences are in S: Pa = 1/10 * 1/10 = 1/100 => d/100 will appear in S twice. ??? Vasiliki0 码力 | 74 页 | 1.06 MB | 1 年前3
Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020Conditions might change • State is accumulated over time 2 events/s time rate decrease events/s time throughput degradation events/s time rate increase : input rate : throughput ??? Vasiliki Kalavri with sources S1, S2, … Sn and rates λ1, λ2, … λn identify the minimum parallelism πi per operator i, such that the physical dataflow can sustain all source rates. S1 S2 λ1 λ2 S1 S2 π=2 π=3 Kalavri | Boston University 2020 12 ??? Vasiliki Kalavri | Boston University 2020 12 effect of Dhalion’s scaling actions in an initially under-provisioned wordcount dataflow 1 2 3 6 5 4 ??? Vasiliki0 码力 | 93 页 | 2.42 MB | 1 年前3
共 25 条
- 1
- 2
- 3













