Prometheus Deep Dive - Monitoring. At scale.Deep Dive Introduction Intro 2.0 to 2.2.1 2.4 - 2.6 Beyond Outro Prometheus 101 Inspired by Google’s Borgmon Time series database int64 timestamp, float64 value Ecosystem of instrumentation & exporters Beyond Outro Cloudy with a chance of buzzwords So it’s built with highly dynamic environments in mind It’s the second project to ever join CNCF and the de facto standard in cloud-native monitoring Kubelets Kubelets, sidecars, microservices, ALL the cloud-native But it’s a monolithic application ...why? Richard Hartmann & Frederic Branczyk @TwitchiH & @fredbrancz Prometheus Deep Dive Introduction Intro 20 码力 | 34 页 | 370.20 KB | 1 年前3
OpenMetrics - Standing on the shoulders of Titansdebian,richih}.org, @TwitchiH OpenMetrics Introduction Quick intro OpenMetrics Outro Prometheus What’s Prometheus? You can’t talk about OpenMetrics without mentioning Prometheus Richard Hartmann, RichiH@{freenode debian,richih}.org, @TwitchiH OpenMetrics Introduction Quick intro OpenMetrics Outro Prometheus What’s Prometheus? Show of hands: Who has heard of Prometheus? Richard Hartmann, RichiH@{freenode,OFTC,IRCnet} Introduction Quick intro OpenMetrics Outro Problem statement After Prometheus Prometheus has become a de-facto standard in cloud-native metric monitoring Ease of exposing data has lead to an explosion in0 码力 | 21 页 | 84.83 KB | 1 年前3
B站统⼀监控系统的设计,演进
与实践分享具备现代时间序列列数据库的特性 • 活跃项⽬目,具有成熟的⽣生态环境 结论 • prometheus • ⽀支持任意维度label • cncf基⾦金金会 metric • 40w+/s的指标采集 • 10k+ 监控⽬目标 • 10+ prometheus节点 现状: • 性能 • ⾼高可⽤用 • 分布式 • 使⽤用成本 问题: ? 性能问题 • 本地ssd prometheus IDC HA prometheus server1 server2 server3 prometheus IDC Federation pr s s s pr I pr s s s pr I IDC1 IDC2 prometheus prometheus filter数据 精度降低 建议 降低使⽤用成本 agent prometheus 先让告警有意义 可读的 • 时间 • 源头 • 规则 • 影响 • 状态 正确的 有价值的 • 发现问题 • 正确反映现实 案例例1 告警规则: 业务A 慢请求量量 > 10k/s 固定阈值 告警阈值需要随着流量量变化⽽而调整 wrong 建议: 告警规则: 业务A 慢请求⽐比例例 > 80% 案例例2 告警规则: 磁盘容量量可⽤用率 <10% 告警规则: 磁盘容量量预计将于3⼩小时后饱和0 码力 | 34 页 | 650.25 KB | 1 年前3
Intro to Prometheus - With a dash of operations & observabilityPrometheus Introduction Background Operations & observability Outro Prometheus 101 Inspired by Google’s Borgmon Time series database unit64 millisecond timestamp, float64 value Instrumentation & exporters Introduction Background Operations & observability Outro Sanity & sleep If it’s not actionable, it’s not an alert If it’s not urgent, it’s not an alert Important but non-urgent incidents are handled during business business hours Predict your usage so you add capacity during business hours If there’s no playbook, it does not go into production If a service does not have proper SLOs and alerts, it does not go into0 码力 | 19 页 | 63.73 KB | 1 年前3
PromQL 从入门到精通率是 10s, 而其他的机器采集频率是 30s。 ? 通过 range query + Table 视图,可以让我们直观看到原始上报的监控数据以及上报的具 体时刻(对于排查监控数据采集相关的问题尤为有用),如果在 Graph 视图,返回的数据取 决于 step 参数,查询时传给时序库的 step = 10,返回的图形就是每 10s 一个点,step = 20 就是每 20s 一个点,返回的数据的时间间隔取决于 0)*60= 23595160.8 ? 上例中,我的测试数据是没有缺失数据点的,如果有缺失数据点的情况,数据外推会更为 复杂,具体可以参考这篇文章:https://mp.weixin.qq.com/s/9aiqrtLTnzysV9olMx-rzA 1 2 3 4 5 6 7 8 9 10 1 2 rate 趁热打铁,说一下 rate 函数,increase 函数是求取的 有个估计算法,它假设落在各个 bucket 的数据是均匀分布的,即10~20这个区间的150个请求,延迟最小的那个请求是10s,延迟最大 的那个请求是20秒,总的第900个请求,就是这个区间的第50个请求,其延迟数据大概是: (20-10)*(50/150)+10=13s 这是假设数据是均匀分布在各个桶的,假设10~20那个桶的150个请求,最大延迟的那个请求, 其延迟数据是11秒,而这里0 码力 | 16 页 | 2.77 MB | 1 年前3
4 【王琼】容器监控架构演进 王琼 YY直播
com/prometheus-vs-victoriametrics-benchmark-on-node-exporter-metrics-4ca29c75590f 总体架构 总体架构 T H A N K S !0 码力 | 23 页 | 2.17 MB | 1 年前3
共 6 条
- 1













