
  • 0,数据接入和告警流程
  • 1,Prometheus
    • 1.1 Prometheus 主程序
      • 1.1.1 修改配置文件:prometheus.yml
      • 1.1.2 验证配置是否正确,然后启动服务(windows 双击exe文件)
      • 1.1.3 访问页面 `http://localhost:9090`
      • 1.1.4 Prometheus QL 查询 `http://localhost:9090/graph`
    • 1.2 采集器 exporter
      • 1.2.1 修改Prometheus 配置文件 prometheus.yml,添加采集任务
    • 1.3 告警器 alertmanager
      • 1.3.1 修改配置文件 alertmanager.yml
      • 1.3.2 修改Prometheus 配置文件 prometheus.yml,添加告警规则
      • 1.3.3 验证配置是否正确,然后启动服务(windows 双击exe文件)
      • 1.3.4 访问页面 `http://localhost:9093`
      • 1.3.5 验证收到的邮件
  • 2,Grafana
    • 2.1 下载安装
    • 2.2 安装使用




  • 监控和告警工具:Prometheus is an open-source systems monitoring and alerting toolkit 。
  • 数据时序存储,可附加标签: Prometheus collects and stores its metrics as time series data ( metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels )

1.1 Prometheus 主程序


1.1.1 修改配置文件:prometheus.yml

# my global config
global:scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
#  alertmanagers:
#  - static_configs:
#    - targets:
#      - ""     # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
#rule_files:#- "rules/*.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "prometheus"# metrics_path defaults to '/metrics'# scheme defaults to 'http'.static_configs:- targets: [""]- job_name: "alertmanager"static_configs:- targets: [""]        - job_name: "linux-c7"static_configs:- targets: [""]- job_name: "redis"static_configs:- targets: [""]   - job_name: "es"static_configs:- targets: [""]

1.1.2 验证配置是否正确,然后启动服务(windows 双击exe文件)

1.1.3 访问页面 http://localhost:9090

查看采集器状态(数据接入接口地址):Status --> Targets

查看采集器收集的指标: (如磁盘使用指标 node_filesystem_free_bytes )

1.1.4 Prometheus QL 查询 http://localhost:9090/graph

1.2 采集器 exporter

以下用node export为例:默认启动端口 9100

[root@c71 ~ ]# tar -xf node_exporter-1.4.0.linux-amd64.tar.gz -C /opt
[root@c71 ~ ]# cd /opt/node_exporter-1.4.0.linux-amd64/[root@c71 node_exporter-1.4.0.linux-amd64]# ./node_exporter --help--web.listen-address=":9100"Address on which to expose metrics and web interface.--web.telemetry-path="/metrics"Path under which to expose metrics.--web.disable-exporter-metricsExclude metrics about the exporter itself (promhttp_*, process_*, go_*).--web.max-requests=40      Maximum number of parallel scrape requests. Use 0 to disable.--collector.disable-defaultsSet all collectors to disabled by default.--web.config=""            [EXPERIMENTAL] Path to config yaml file that can enable TLS or authentication.--log.level=info           Only log messages with the given severity or above. One of: [debug, info, warn, error]--log.format=logfmt        Output format of log messages. One of: [logfmt, json]--version                  Show application version.[root@c71 node_exporter-1.4.0.linux-amd64]# ./node_exporter &

1.2.1 修改Prometheus 配置文件 prometheus.yml,添加采集任务

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "linux-c7"# metrics_path defaults to '/metrics'# scheme defaults to 'http'.          static_configs:- targets: [""]

1.3 告警器 alertmanager

1.3.1 修改配置文件 alertmanager.yml

global:# The smarthost and SMTP sender used for mail notifications.resolve_timeout: 5m #处理超时时间,默认为5minsmtp_smarthost: 'smtp.163.com:25' # 邮箱smtp服务器代理smtp_from: 'user@163.com' # 发送邮箱名称smtp_auth_username: 'user@163.com' # 邮箱名称smtp_auth_password: 'xxx' # 邮箱授权码 (登录163邮箱,并开通smtp服务,获取授权码) smtp_require_tls: falseroute:group_by: ['alertname']group_wait: 30sgroup_interval: 5mrepeat_interval: 1hreceiver: 'email'# 定义模板信心
#  - 'template/*.tmpl'receivers:- name: 'email'email_configs: # 邮箱配置- to: 'user@163.com'  # 接收警报的email配置#html: '{{ template "test.html" . }}' # 设定邮箱的内容模板#headers: { Subject: "[WARN] 报警邮件"} # 接收邮件的标题

1.3.2 修改Prometheus 配置文件 prometheus.yml,添加告警规则

# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets: - ""     # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:# - "first_rules.yml"- "rules/*.yml"

编辑 rules/node_alert.yml

# groups:组告警
# name:组名。报警规则组名称
- name: 主机监控# rules:定义角色rules:# alert:告警名称。 任何实例5分钟内无法访问发出告警- alert: 磁盘使用率告警# expr:表达式。 获取磁盘使用率 大于百分之80 触发expr: 100 - (node_filesystem_free_bytes{mountpoint="/",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 10# for:持续时间。 表示持续一分钟获取不到信息,则触发报警。0表示不使用持续时间for: 1m# labels:定义当前告警规则级别labels:# severity: 指定告警级别。 warning critical severity: warning# annotations: 注释 告警通知annotations:# 调用标签具体指附加通知信息summary: "Instance {{ $labels.instance  }} :{{ $labels.mountpoint }} 分区使用率过高" # 自定义摘要description: "{{ $labels.instance  }} : {{ $labels.job  }} :{{ $labels.mountpoint  }} 这个分区使用大于百分之10% (当前值:{{ $value }})" - name: 采集器状态监控   rules:   - alert: 节点宕机告警 # 告警名称expr: up == 0 # 告警的判定条件,参考Prometheus高级查询来设定for: 2m # 满足告警条件持续时间多久后,才会发送告警labels: #标签项team: nodeannotations: # 解析项,详细解释告警信息summary: "{{$labels.instance}}: has been down"description: "任务名:{{ $labels.job  }} ,节点ip: {{ $labels.instance  }} ,状态:下线了" # 自定义具体描述

重启Prometheus 主程序,查看加载到的告警配置 (点击 Alerts 或Status–> Rules)

1.3.3 验证配置是否正确,然后启动服务(windows 双击exe文件)

1.3.4 访问页面 http://localhost:9093

1.3.5 验证收到的邮件



  • 指标、日志查询: enables you to query, visualize, alert on, and explore your metrics, logs, and traces wherever they are stored
  • 可视化图表展现数据:turn your time-series database (TSDB) data into insightful graphs and visualizations

2.1 下载安装

windows 点击exe文件,安装后服务自动启动(默认端口3000,访问ip:port, 设置密码)

  • 配置文件:D:\soft\GrafanaLabs\grafana\conf\defaults.ini
# Protocol (http, https, h2, socket)
protocol = http# The ip address to bind to, empty will bind to all interfaces
http_addr =# The http port to use
http_port = 3000

2.2 安装使用

    1. 接入数据源prometheus(已接入各采集器数据)
    1. 创建dashboard(根据grafana模板生成)


Dashboard-- > Import --> 输入id号、或导入json文件,然后 load

    1. 调整相应的指标展示项

