distributed systems - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Analytics Vasiliki (Vasia) Kalavri  vkalavri@bu.edu Spring 2020 1/28: Stream ingestion and pub/sub systems Streaming sources Files, e.g. transaction logs Sockets IoT devices and sensors Databases Message queues and brokers Where do stream processors read data from? 2 Challenges • can be distributed • out-of-sync sources may produce out-of-order streams • can be connected to the network might process a message out-of-order or twice 14 How can we avoid this? 15 Publish/Subscribe Systems publisher publisher publisher publisher subscriber notify() subscriber notify() subscriber

0 码力 | 33 页 | 700.14 KB | 1 年前
3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Useful in theory for the development of streaming algorithms With limited practical value in distributed, real-world settings Vasiliki Kalavri | Boston University 2020 Cash-Register Model: In this University 2020 Dataflow Streaming Model Vasiliki Kalavri | Boston University 2020 Dataflow Systems Distributed execution Partitioned state Exact results Out-of-order support Single-node execution Synopses Synopses and sketches Approximate results In-order data processing Stream Database Systems 2000 1992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Samza

0 码力 | 45 页 | 1.22 MB | 1 年前
3
Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

about? The design and architecture of modern distributed streaming 4 Fundamental for representing, summarizing, and analyzing data streams Systems Algorithms Architecture and design Scheduling streaming systems • be proficient in using Apache Flink and Kafka to build end-to-end, scalable, and reliable streaming applications • have a solid understanding of how stream processing systems work and industry • Learn from experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 2020

0 码力 | 34 页 | 2.53 MB | 1 年前
3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Pipeline: A || B Task: B || C Data: A || A ??? Vasiliki Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki Kalavri | Boston University 2020 9 Identify the most efficient D A B C D ??? Vasiliki Kalavri | Boston University 2020 22 • Multi-tenancy • in streaming systems that build one dataflow graph for several queries • when applications analyze data streams from Operator Placement for Stream-Processing Systems. ICDE 2006. • Brian Babcock et. al. Chain : Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003. • Donald Carney et. al.

0 码力 | 54 页 | 2.83 MB | 1 年前
3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Vasiliki Kalavri | Boston University 2020 Today’s topics • High-availability and fault-tolerance in distributed stream processing • Recovery semantics and guarantees • Exactly-once processing in Apache Beam models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming systems will fail • how can we guard state against failures and guarantee correct results after University 2020 Further resources • Jeong-Hyon Hwang et al. High-Availability Algorithms for Distributed Stream Processing. (ICDE ’05). • http://cs.brown.edu/research/aurora/hwang.icde05.ha.pdf •

0 码力 | 49 页 | 2.08 MB | 1 年前
3
PyFlink 1.15 Documentation

environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible

0 码力 | 36 页 | 266.77 KB | 1 年前
3
PyFlink 1.16 Documentation

environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible

0 码力 | 36 页 | 266.80 KB | 1 年前
3
Scalable Stream Processing - Spark Streaming and Flink

The Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design Issues ▶ Continuous vs. micro-batch processing ▶ Record-at-a-Time vs. declarative APIs streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources

0 码力 | 113 页 | 1.22 MB | 1 年前
3
Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020

?? Vasiliki Kalavri | Boston University 2020 Batch Graph Processing 9 Batch graph processing systems, such as Apache Graph, GraphX, Pregel, operate offline. They are built to analyze a snapshot of 8 35 How would you implement this in Flink? ??? Vasiliki Kalavri | Boston University 2020 Distributed Stream Connected Components 36 1. partition the edge stream, e.g. by source Id 2. maintain a 1145/2627692.2627694 • Stanton, Isabelle, and Gabriel Kliot. Streaming graph partitioning for large distributed graphs. ACM SIGKDD, 2012. https://www.microsoft.com/en-us/ research/wp-content/uploads/2012/08/kdd325-stanton

0 码力 | 72 页 | 7.77 MB | 1 年前
3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

and stream ingestion 12 ??? Vasiliki Kalavri | Boston University 2020 –Leslie Lamport The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University RolOIQjOAEPLqAOt9AHxhweIZXeHOk8+K8Ox+z1iWnmDmAP3A+fwCD9I4G We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination is eventually captured A snapshot algorithm attempts to capture a coherent global state of a distributed system ??? Vasiliki Kalavri | Boston University 2020 Snapshotting Protocols p1 p2 p3 C m

0 码力 | 81 页 | 13.18 MB | 1 年前
3

共 21 条前往

页

分类

语言

格式

Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020

PyFlink 1.15 Documentation

PyFlink 1.16 Documentation

Scalable Stream Processing - Spark Streaming and Flink

Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020