 Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/28: Stream ingestion and pub/sub systems Streaming sources Files, e.g. transaction logs Sockets IoT devices and sensors Databases Message queues and brokers Where do stream processors read data from? 2 Challenges • can be distributed • out-of-sync sources may produce out-of-order streams • can be connected to the network might process a message out-of-order or twice 14 How can we avoid this? 15 Publish/Subscribe Systems publisher publisher publisher publisher subscriber notify() subscriber notify() subscriber0 码力 | 33 页 | 700.14 KB | 1 年前3 Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/28: Stream ingestion and pub/sub systems Streaming sources Files, e.g. transaction logs Sockets IoT devices and sensors Databases Message queues and brokers Where do stream processors read data from? 2 Challenges • can be distributed • out-of-sync sources may produce out-of-order streams • can be connected to the network might process a message out-of-order or twice 14 How can we avoid this? 15 Publish/Subscribe Systems publisher publisher publisher publisher subscriber notify() subscriber notify() subscriber0 码力 | 33 页 | 700.14 KB | 1 年前3
 Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020Useful in theory for the development of streaming algorithms With limited practical value in distributed, real-world settings Vasiliki Kalavri | Boston University 2020 Cash-Register Model: In this University 2020 Dataflow Streaming Model Vasiliki Kalavri | Boston University 2020 Dataflow Systems Distributed execution Partitioned state Exact results Out-of-order support Single-node execution Synopses Synopses and sketches Approximate results In-order data processing Stream Database Systems 2000 1992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Samza0 码力 | 45 页 | 1.22 MB | 1 年前3 Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020Useful in theory for the development of streaming algorithms With limited practical value in distributed, real-world settings Vasiliki Kalavri | Boston University 2020 Cash-Register Model: In this University 2020 Dataflow Streaming Model Vasiliki Kalavri | Boston University 2020 Dataflow Systems Distributed execution Partitioned state Exact results Out-of-order support Single-node execution Synopses Synopses and sketches Approximate results In-order data processing Stream Database Systems 2000 1992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Samza0 码力 | 45 页 | 1.22 MB | 1 年前3
 Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020about? The design and architecture of modern distributed streaming 4 Fundamental for representing, summarizing, and analyzing data streams Systems Algorithms Architecture and design Scheduling streaming systems • be proficient in using Apache Flink and Kafka to build end-to-end, scalable, and reliable streaming applications • have a solid understanding of how stream processing systems work and industry • Learn from experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 20200 码力 | 34 页 | 2.53 MB | 1 年前3 Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020about? The design and architecture of modern distributed streaming 4 Fundamental for representing, summarizing, and analyzing data streams Systems Algorithms Architecture and design Scheduling streaming systems • be proficient in using Apache Flink and Kafka to build end-to-end, scalable, and reliable streaming applications • have a solid understanding of how stream processing systems work and industry • Learn from experts with decades of hands-on experience in building and using distributed systems and data management platforms • Have fun! 10 Vasiliki Kalavri | Boston University 20200 码力 | 34 页 | 2.53 MB | 1 年前3
 Streaming optimizations	- CS 591 K1: Data Stream Processing and Analytics Spring 2020Pipeline: A || B Task: B || C Data: A || A ??? Vasiliki Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki Kalavri | Boston University 2020 9 Identify the most efficient D A B C D ??? Vasiliki Kalavri | Boston University 2020 22 • Multi-tenancy • in streaming systems that build one dataflow graph for several queries • when applications analyze data streams from Operator Placement for Stream-Processing Systems. ICDE 2006. • Brian Babcock et. al. Chain : Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003. • Donald Carney et. al.0 码力 | 54 页 | 2.83 MB | 1 年前3 Streaming optimizations	- CS 591 K1: Data Stream Processing and Analytics Spring 2020Pipeline: A || B Task: B || C Data: A || A ??? Vasiliki Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki Kalavri | Boston University 2020 9 Identify the most efficient D A B C D ??? Vasiliki Kalavri | Boston University 2020 22 • Multi-tenancy • in streaming systems that build one dataflow graph for several queries • when applications analyze data streams from Operator Placement for Stream-Processing Systems. ICDE 2006. • Brian Babcock et. al. Chain : Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003. • Donald Carney et. al.0 码力 | 54 页 | 2.83 MB | 1 年前3
 High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 Today’s topics • High-availability and fault-tolerance in distributed stream processing • Recovery semantics and guarantees • Exactly-once processing in Apache Beam models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming systems will fail • how can we guard state against failures and guarantee correct results after University 2020 Further resources • Jeong-Hyon Hwang et al. High-Availability Algorithms for Distributed Stream Processing. (ICDE ’05). • http://cs.brown.edu/research/aurora/hwang.icde05.ha.pdf •0 码力 | 49 页 | 2.08 MB | 1 年前3 High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020Vasiliki Kalavri | Boston University 2020 Today’s topics • High-availability and fault-tolerance in distributed stream processing • Recovery semantics and guarantees • Exactly-once processing in Apache Beam models State in dataflow computations 3 Vasiliki Kalavri | Boston University 2020 4 Distributed streaming systems will fail • how can we guard state against failures and guarantee correct results after University 2020 Further resources • Jeong-Hyon Hwang et al. High-Availability Algorithms for Distributed Stream Processing. (ICDE ’05). • http://cs.brown.edu/research/aurora/hwang.icde05.ha.pdf •0 码力 | 49 页 | 2.08 MB | 1 年前3
 PyFlink 1.15 Documentationenvironment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible0 码力 | 36 页 | 266.77 KB | 1 年前3 PyFlink 1.15 Documentationenvironment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible0 码力 | 36 页 | 266.77 KB | 1 年前3
 PyFlink 1.16 Documentationenvironment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible0 码力 | 36 页 | 266.80 KB | 1 年前3 PyFlink 1.16 Documentationenvironment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible the above example, the Python virtual environment is specified via option -pyarch. It will be distributed to the cluster nodes during job execution. It should be noted that option -pyexec is also required environment during submitting PyFlink jobs. In this way, the Python virtual environment will be distributed to the cluster nodes where PyFlink jobs are running on during job starting up. This is more flexible0 码力 | 36 页 | 266.80 KB | 1 年前3
 Scalable Stream Processing - Spark Streaming and FlinkThe Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design Issues ▶ Continuous vs. micro-batch processing ▶ Record-at-a-Time vs. declarative APIs streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources0 码力 | 113 页 | 1.22 MB | 1 年前3 Scalable Stream Processing - Spark Streaming and FlinkThe Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design Issues ▶ Continuous vs. micro-batch processing ▶ Record-at-a-Time vs. declarative APIs streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources streaming sources: 1. Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections. 2. Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter. 3. Custom sources0 码力 | 113 页 | 1.22 MB | 1 年前3
 Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020?? Vasiliki Kalavri | Boston University 2020 Batch Graph Processing 9 Batch graph processing systems, such as Apache Graph, GraphX, Pregel, operate offline. They are built to analyze a snapshot of 8 35 How would you implement this in Flink? ??? Vasiliki Kalavri | Boston University 2020 Distributed Stream Connected Components 36 1. partition the edge stream, e.g. by source Id 2. maintain a 1145/2627692.2627694 • Stanton, Isabelle, and Gabriel Kliot. Streaming graph partitioning for large distributed graphs. ACM SIGKDD, 2012. https://www.microsoft.com/en-us/ research/wp-content/uploads/2012/08/kdd325-stanton0 码力 | 72 页 | 7.77 MB | 1 年前3 Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020?? Vasiliki Kalavri | Boston University 2020 Batch Graph Processing 9 Batch graph processing systems, such as Apache Graph, GraphX, Pregel, operate offline. They are built to analyze a snapshot of 8 35 How would you implement this in Flink? ??? Vasiliki Kalavri | Boston University 2020 Distributed Stream Connected Components 36 1. partition the edge stream, e.g. by source Id 2. maintain a 1145/2627692.2627694 • Stanton, Isabelle, and Gabriel Kliot. Streaming graph partitioning for large distributed graphs. ACM SIGKDD, 2012. https://www.microsoft.com/en-us/ research/wp-content/uploads/2012/08/kdd325-stanton0 码力 | 72 页 | 7.77 MB | 1 年前3
 Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020and stream ingestion 12 ??? Vasiliki Kalavri | Boston University 2020 –Leslie Lamport The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University RolOIQjOAEPLqAOt9AHxhweIZXeHOk8+K8Ox+z1iWnmDmAP3A+fwCD9I4G We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination is eventually captured A snapshot algorithm attempts to capture a coherent global state of a distributed system ??? Vasiliki Kalavri | Boston University 2020 Snapshotting Protocols p1 p2 p3 C m0 码力 | 81 页 | 13.18 MB | 1 年前3 Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020and stream ingestion 12 ??? Vasiliki Kalavri | Boston University 2020 –Leslie Lamport The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University RolOIQjOAEPLqAOt9AHxhweIZXeHOk8+K8Ox+z1iWnmDmAP3A+fwCD9I4G We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination is eventually captured A snapshot algorithm attempts to capture a coherent global state of a distributed system ??? Vasiliki Kalavri | Boston University 2020 Snapshotting Protocols p1 p2 p3 C m0 码力 | 81 页 | 13.18 MB | 1 年前3
共 21 条
- 1
- 2
- 3













