 Prometheus Deep Dive - Monitoring. At scale.Introduction Intro 2.0 to 2.2.1 2.4 - 2.6 Beyond Outro Storage Results 15x reduction in memory usage 6x reduction in CPU usage 80-100x reduction in disk writes 5x reduction in on-disk size 4x reduction in has not yet been committed, or has been rolled back, is ignored at query time We keep write IDs in memory; if we restart or crash, the atomicity of the write ahead log will protect us Richard Hartmann &0 码力 | 34 页 | 370.20 KB | 1 年前3 Prometheus Deep Dive - Monitoring. At scale.Introduction Intro 2.0 to 2.2.1 2.4 - 2.6 Beyond Outro Storage Results 15x reduction in memory usage 6x reduction in CPU usage 80-100x reduction in disk writes 5x reduction in on-disk size 4x reduction in has not yet been committed, or has been rolled back, is ignored at query time We keep write IDs in memory; if we restart or crash, the atomicity of the write ahead log will protect us Richard Hartmann &0 码力 | 34 页 | 370.20 KB | 1 年前3
 告警OnCall事件中心建设方法白皮书
incident。举个例子,最原始的告警事件,比如 host1 在 timestamp1 产生了一条 cpu_usage_idle 的告警,我们称为一个 event。如果没有恢复,一段时间之 后,比如 timestamp1 + 60min,一般会再发出一个告警,还是 host1,还是 cpu_usage_idle 这个指 标。很明显,这两个告警事件是有关联关系的,指代的是一个问题,只是时间戳不同,这样的两个 alert 的唯一标识。比如 刚才的例子,告警策略的 ID 假设为 32,标签集是:[“name=cpu_usage_idle”, “host=host1”], 这两个时间戳产生的告警事件,哈希值都是一样的。 计算方法是: hash(32 + ["__name__=cpu_usage_idle", "host=host1"]) 从 event 到 alert 的这个收敛逻辑, 会把事件聚合为告警,告警聚合为故障,最终通知的是故障。那具体如何聚合呢? 告警聚合 事件到告警的聚合比较容易,通常是用类似下面的算法来计算不同事件的关联关系: hash(32 + ["__name__=cpu_usage_idle", "host=host1"]) 这个值姑且称为事件 Hash,相同 Hash 的事件就被聚合为一条告警。更复杂的是告警到故障的合并,当 前我们支持基于规则的聚合,后面会基于算法聚合:0 码力 | 23 页 | 1.75 MB | 1 年前3 告警OnCall事件中心建设方法白皮书
incident。举个例子,最原始的告警事件,比如 host1 在 timestamp1 产生了一条 cpu_usage_idle 的告警,我们称为一个 event。如果没有恢复,一段时间之 后,比如 timestamp1 + 60min,一般会再发出一个告警,还是 host1,还是 cpu_usage_idle 这个指 标。很明显,这两个告警事件是有关联关系的,指代的是一个问题,只是时间戳不同,这样的两个 alert 的唯一标识。比如 刚才的例子,告警策略的 ID 假设为 32,标签集是:[“name=cpu_usage_idle”, “host=host1”], 这两个时间戳产生的告警事件,哈希值都是一样的。 计算方法是: hash(32 + ["__name__=cpu_usage_idle", "host=host1"]) 从 event 到 alert 的这个收敛逻辑, 会把事件聚合为告警,告警聚合为故障,最终通知的是故障。那具体如何聚合呢? 告警聚合 事件到告警的聚合比较容易,通常是用类似下面的算法来计算不同事件的关联关系: hash(32 + ["__name__=cpu_usage_idle", "host=host1"]) 这个值姑且称为事件 Hash,相同 Hash 的事件就被聚合为一条告警。更复杂的是告警到故障的合并,当 前我们支持基于规则的聚合,后面会基于算法聚合:0 码力 | 23 页 | 1.75 MB | 1 年前3
 Intro to Prometheus - With a dash of operations & observabilitynot an alert Important but non-urgent incidents are handled during business hours Predict your usage so you add capacity during business hours If there’s no playbook, it does not go into production observability Outro Leverage One combined system allows for correlation and combination Power usage against service load Optical networks against outside temperature Datacenter power feed load against state Dashboards for drill-down Auto-generated PDFs for customers Global SLO statements for sales Usage exports for accounting If all you have is a hammer... choose your hammer well Richard Hartmann &0 码力 | 19 页 | 63.73 KB | 1 年前3 Intro to Prometheus - With a dash of operations & observabilitynot an alert Important but non-urgent incidents are handled during business hours Predict your usage so you add capacity during business hours If there’s no playbook, it does not go into production observability Outro Leverage One combined system allows for correlation and combination Power usage against service load Optical networks against outside temperature Datacenter power feed load against state Dashboards for drill-down Auto-generated PDFs for customers Global SLO statements for sales Usage exports for accounting If all you have is a hammer... choose your hammer well Richard Hartmann &0 码力 | 19 页 | 63.73 KB | 1 年前3
共 3 条
- 1













