prometheus监控常用告警规则

1、监控服务器是否重启

  - alert: CentosServiceRestartexpr: time() - node_boot_time_seconds < 180for: 2mlabels:severity: warningannotations:summary: "Instance is restart"description: "Instance is restarted, uptime <3min"

  - alert: WindowsServiceRestartexpr: time() - windows_system_system_up_time < 180for: 2mlabels:severity: warningannotations:summary: "Instance is restart"description: "Instance is restarted, uptime <3min"

2、内存使用过高

  - alert: InstanceMemUsageHighexpr: 100 - (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)*100 > 98for: 2mlabels:severity: criticalannotations:summary: "Memory usage high"description: "Memory usage above 98%.(current usage: {{ $value }}%)"

  - alert: WinInstanceMemUsageHighexpr: 100-(windows_os_physical_memory_free_bytes/windows_cs_physical_memory_bytes)*100 > 98for: 3mlabels:severity: criticalannotations:summary: "Instance memory usage high"description: "Instance memory usage above 98%.(current usage: {{ $value }}%)"

3、CPU使用过高

  - alert: CPUUsageHighexpr: 100-(avg(irate(node_cpu_seconds_total[2m])) by (instance,region) *100) > 90for: 3mlabels:severity: warningannotations:summary: "CPU usage high"description: "CPU usage above 90%.(current usage: {{ $value }})"

  - alert: WinCpuUsageexpr: 100 - (avg by (instance,region) (irate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 90for: 3mlabels:severity: warningannotations:summary: "Instance CPU usage high"description: "Instance CPU Usage is more than 90%.(current usage: {{ $value }}%)"

4、磁盘使用率过高

  - alert: DiskUsageHighexpr: 100 - (node_filesystem_avail_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes{fstype=~"ext4|xfs"} )*100 > 95for: 1mlabels:severity: criticalannotations:summary: "Disk usage high"description: "Disk {{ $labels.mountpoint }} usage above 95%.(current usage: {{ $value }})"

  - alert: WinDiskUsageHighexpr: 100-(windows_logical_disk_free_bytes/windows_logical_disk_size_bytes)*100 > 95for: 1mlabels:severity: criticalannotations:summary: "Instance disk usage high"description: "Instance disk {{ $labels.volume }} usage above 95%.(current usage: {{ $value }}%)"

5、网络吞吐量

  - alert: HostUnusualNetworkThroughputInexpr: sum by (instance,device,region) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 30for: 5mlabels:severity: warningannotations:summary: "Host unusual network throughput in"description: "Host network interfaces are receiving too much data (> 30 MB/s).(current speed:{{ $value }}MB/s)"

  - alert: WinHostUnusualNetworkThroughputInexpr: sum by (instance,nic,region) (irate(windows_net_bytes_received_total{nic=~".*VirtIO.*"}[2m])) / 1024 / 1024>30for: 5mlabels:severity: warningannotations:summary: "Host unusual network throughput in"description: "Host network interfaces are probably receiving too much data (> 30 MB/s).(current speed: {{ $value }})"

  - alert: HostUnusualNetworkThroughputOutexpr: sum by (instance,device,region) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 30for: 5mlabels:severity: warningannotations:summary: "Host unusual network throughput out"description: "Host network interfaces are sending too much data (> 30 MB/s).(current speed:{{ $value }}MB/s)"

6、TCP连接

  - alert: TCPEstablishedNumexpr: node_netstat_Tcp_CurrEstab > 2000for: 1mlabels:severity: warningannotations:summary: "TCP established connect too many"description: "TCP establised connect count excess 2000.(current count: {{ $value }})"

7、服务器网络传输错误

  - alert: HostNetworkTransmitErrorsexpr: increase(node_network_transmit_errs_total[5m]) > 2for: 5mlabels:severity: warningannotations:summary: "Host Network Transmit Errors"#description: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%v" $value }} transmit errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"description: "Interface {{ $labels.device }} has transmit errors in the last five minutes.(current error packages:{{ $value }})"

8、磁盘读写延迟

  - alert: HostUnusualDiskReadLatencyexpr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) * 1000 > 100for: 5mlabels:severity: warningannotations:summary: "Host unusual disk read latency"description: "Disk read latency is growing (read operations > 100ms).(current latency: {{ $value }}ms)"

  - alert: HostUnusualDiskWriteLatencyexpr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) * 1000 > 100for: 5mlabels:severity: warningannotations:summary: "Host unusual disk write latency"description: "Disk write latency is growing (write operations > 100ms).(current latency: {{ $value }}ms)"

9、磁盘IO过高

  - alert: DiskIOTimePerSecexpr: irate(node_disk_io_time_seconds_total[1m])*100 > 60for: 2mlabels:severity: warning annotations:summary: "Host disk io time high"description: "Disk {{ $labels.device }} io time occupy above 60% (current rate: {{ $value }})"

prometheus监控常用告警规则相关推荐

Prometheus监控以及告警配置
Prometheus监控 Prometheus简介 Prometheus是一套开源的系统监控报警框架.Prometheus作为新一代的云原生监控系统,相比传统监控监控系统(Nagios或者Zabbix ...
实用干货丨如何使用Prometheus配置自定义告警规则
前言 Prometheus是一个用于监控和告警的开源系统.一开始由Soundcloud开发,后来在2016年,它迁移到CNCF并且称为Kubernetes之后最流行的项目之一.从整个Linux服务器 ...
Prometheus监控告警规则
Prometheus监控MongoDB报警规则.MySQL报警规则.Nginx报警规则.Redis报警规则. MongoDB报警规则报警名称表达式采集数据时间(分钟) 报警触发条件 Mongod ...
prometheus监控+告警
1 开始安装前的准备 1.1 修改主机名 1.2 关闭防火墙 1.3 关闭seliunx 1.4 关闭防火墙 1.5 下载阿里云的yum源 2 下载所用到的包 2.1 安装 node_porter 2 ...
Prometheus 监控报警系统 AlertManager 之邮件告警
文章目录 1.Prometheus & AlertManager 介绍 2.环境.软件准备 3.启动并配置 Prometheus 3.1.Docker 启动 Prometheus 3.2.Do ...
Prometheus告警规则
完整译文请访问:http://www.coderdocument.com/docs/prometheus/v2.14/prometheus/configuration/alerting_rules.h ...
prometheus告警规则管理
微型公众号:运维开发故事,作者:夏老师什么是Rule Prometheus支持用户自定义Rule规则. Rule分为两类,一类是Recording Rule,另一类是Alerting Rule.Re ...
【Prometheus】Alertmanager告警全方位讲解
Prometheus告警简介告警能力在Prometheus的架构中被划分成两个独立的部分.如下所示,通过在Prometheus中定义AlertRule(告警规则),Prometheus会周期性的对告 ...
Prometheus 监控
Prometheus 企业监控一.介绍本文介绍Prometheus 监控及在k8s集群中使用node-exporter.prometheus.grafana对集群进行监控.实现原理类似ELK.EF ...

prometheus监控常用告警规则

prometheus监控常用告警规则相关推荐

最新文章

热门文章