Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages ??? Vasiliki Kalavri | Boston University 2020 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages Naive solution: maintain a hash table ??? Vasiliki University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages Naive solution: maintain a hash table Convert0 码力 | 69 页 | 630.01 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020proportion of the stream, e.g. 1/10th 7 search enginequery stream Example use-case: Web search user behavior study Q: How many queries did users repeat last month? ??? Vasiliki we can store 1/10th of the stream, we select a stream element i with probability 10%. • We can use a random generator that produces an integer ri between 0 and 9. We then select an input element i we can store 1/10th of the stream, we select a stream element i with probability 10%. • We can use a random generator that produces an integer ri between 0 and 9. We then select an input element i 0 码力 | 74 页 | 1.06 MB | 1 年前3
监控Apache Flink应用程序(入门)13 4.7 仪表盘示例 Figure 4: Event Time Lag per Subtask of a single operator in the topology. In this case, the watermark is lagging a few seconds behind for each subtask. caolei – 监控Apache Flink应用程序(入门) network buffers, which can be configured7. • Mapped memory is usually close to zero as Flink does not use memory-mapped files. In a containerized environment you should additionally monitor the overall memory RocksDB allocates a considerable amount of memory off heap. To understand how much memory RocksDB might use, you can checkout this blog post8 by Stefan Richter. 4.13.1.1 Key Metrics Metric Scope Description0 码力 | 23 页 | 148.62 KB | 1 年前3
State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020large, GC pauses • Checkpoints sent to JobManager's heap memory, i.e. the state is lost in case of failure • Use only for development and debugging purposes! FsStateBackend • Stores state on TaskManager’s de/serialization • Checkpoints state to a remote file system and supports incremental checkpoints • Use for applications with very large state Which backend to choose? 9 Vasiliki Kalavri | Boston University state This is the state of the current key (sensor id) Vasiliki Kalavri | Boston University 2020 Use keyed state to store and access state in the context of a key attribute: • For each distinct value0 码力 | 24 页 | 914.13 KB | 1 年前3
Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020Window sensor readings Vasiliki Kalavri | Boston University 2020 In the DataStream API, you can use the time characteristic to tell Flink how to define time when you are creating windows. The time characteristic streaming execution environment val env = StreamExecutionEnvironment.getExecutionEnvironment // use event time for the application env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) Vasiliki Kalavri | Boston University 2020 Time-based window assigners for the most common windowing use cases: • They assign an element based on its event-time timestamp or the current processing time0 码力 | 35 页 | 444.84 KB | 1 年前3
PyFlink 1.15 Documentationdeploying a PyFlink job to production when there are massive Python dependencies. It’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details --python /path/to/python/executable venv The virtual environment needs to be activated before to use it. To activate the virtual environment, run: source venv/bin/activate That is, execute the activate conda create --name venv python=3.8 -y The conda virtual environment needs to be activated before to use it. To activate the conda virtual environment, run: 4 Chapter 1. How to build docs locally pyflink-docs0 码力 | 36 页 | 266.77 KB | 1 年前3
PyFlink 1.16 Documentationdeploying a PyFlink job to production when there are massive Python dependencies. It’s supported to use Python virtual environment in your PyFlink jobs, see PyFlink Dependency Management for more details --python /path/to/python/executable venv The virtual environment needs to be activated before to use it. To activate the virtual environment, run: source venv/bin/activate That is, execute the activate conda create --name venv python=3.8 -y The conda virtual environment needs to be activated before to use it. To activate the conda virtual environment, run: 4 Chapter 1. How to build docs locally pyflink-docs0 码力 | 36 页 | 266.80 KB | 1 年前3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020split merge split When might this be beneficial? ??? Vasiliki Kalavri | Boston University 2020 • Use equivalence transformation rules if the language allows • selection operations are commutative Fused operators can share the address space but use separate threads of control • avoid communication cost without losing pipeline parallelism • use a shared buffer for communication • Fused filters small time intervals • Keep intermediate state in memory • Use Spark's RDDs instead of replication • Parallel recovery mechanism in case of failures 44 input stream time-based micro-batches D-Streams0 码力 | 54 页 | 2.83 MB | 1 年前3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020is called a sequence, of length n, of tuples from R. The empty sequence [ ] has length 0. We use t ∈ S to denote that, for some 1 ≤ i ≤ n, ti = t. 23 Vasiliki Kalavri | Boston University 2020 Model of departments that satisfy this query However this sum query cannot be expressed without the use of aggregates! 31 Non-blocking SQL Vasiliki Kalavri | Boston University 2020 SQL extensions and University 2020 SQL extensions for streams Why SQL-based approaches? • Ideally, we would like to use the same language for querying both streaming and static data. Requirements (or why SQL is not0 码力 | 53 页 | 532.37 KB | 1 年前3
Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020addresses • the collection of IP addresses accessing a web server 12 With some practical value for use-cases with append-only data It preserves all history without the option to discard old events Vasiliki connected components of accounts in a stream of financial transactions? What synopsis would you use to compute: 32 Vasiliki Kalavri | Boston University 2020 Issues with synopses • They are lossy punctuations • window fires, post becomes inactive 41 Vasiliki Kalavri | Boston University 2020 case class Reading(id: String, time: Long, temp: Double) object MaxSensorReadings { def main(args:0 码力 | 45 页 | 1.22 MB | 1 年前3
共 21 条
- 1
- 2
- 3













