Promethus监控系统

一、普罗米修斯概述
二、时间序列
- 1、什么是序列数据
- 2、基于时间序列数据特点
- 3、普罗米修斯特征
- 4、普罗米修斯原理架构图
三、实验环境准备
四、安装普罗米修斯
- 1、下载软件
- 2、安装普罗米修斯软件
- 3、普罗米修斯软件界面（Web）
五、监控远程Linux主机
六、监控远程MySQL
七、为普罗米修斯添加图形图像
- 1、安装Grafana可视化工具
- 2、添加数据源
- 3、为添加好的数据源做图形显示
八、Grafana图形显示MySQL监控
九、Grafana+onealert报警
十、学习PromQL
- 10.1 数据模型
- - 1、数据类型
- 2、选择器
十一、集群

学习目标

能够安装prometheus服务器
能够通过安装node_exporter监控远程linux
能够通过安装mysqld_exporter监控远程mysql数据库
能够安装grafana
能够在grafana添加prometheus数据源
能够在grafana添加监控cpu负载的图形
能够在grafana图形显示mysq监控数据
能够通过grafana+onealert实现报警

任务背景
某某某公司是一家电商网站，由于公司的业务快速发展，公司要求对现有机器进行业务监控，责成运维部门来实施这个项目。

任务要求
1）部署监控服务器，实现7X24实时监控
2）针对公司的业务及研发部门设计监控系统，对监控项和触发器拿出合理意见
3）做好问题预警机制，对可能出现的问题要及时告警并形成严格的处理机制
4）做好监控告警系统，要求可以实现告警分级
一级报警电话通知
二级报警微信通知
三级报警邮件通知
5）处理好公司服务器异地集中监控问题

任务分析
为什么要监控？
答：实时收集数据，通过报警及时发现问题，及时处理。数据为优化也可以提供依据。
监控四要素：

监控对象 [主机状态服务资源页面，ur]
用什么监控 [zabbix-server zabbix-agent]
什么时间监控 [7x24 5×8]
报警给谁 [管理员]

项目选型：

zabbix 跨平台，画图，多条件告警，多种API接口。使用基数特别大
prometheus 基于时间序列的数值数据的容器监控解决方案

一、普罗米修斯概述

Prometheus（由go语言（golang）开发）是一套开源的监控&报警&时间序列数据库的组合。适合监控docker容器。因为kubernetes（俗称k8s）的流行带动了prometheus的发展。
https://prometheus.io/docs/introduction/overview/

Prometheus 来作为我们的业务监控，因为它具有以下优点：

支持 PromQL（一种查询语言），可以灵活地聚合指标数据
部署简单，只需要一个二进制文件就能跑起来，不需要依赖分布式存储
Go 语言编写，组件更方便集成在同样是Go编写项目代码中
原生自带 WebUI，通过 PromQL 渲染时间序列到面板上
生态组件众多，Alertmanager，Pushgateway，Exporter……

在上面流程中，Prometheus 通过配置文件中指定的服务发现方式来确定要拉取监控指标的目标（Target），接着从要拉取的目标（应用容器和Pushgateway）发起HTTP请求到特定的端点（Metric Path），将指标持久化至本身的TSDB中，TSDB最终会把内存中的时间序列压缩落到硬盘，除此之外，Prometheus 会定期通过 PromQL 计算设置好的告警规则，决定是否生成告警到 Alertmanager，后者接收到告警后会负责把通知发送到邮件或企业内部群聊中。

Prometheus 的指标名称只能由 ASCII 字符、数字、下划线以及冒号组成，而且有一套命名规范：

使用基础 Unit（如 seconds 而非 milliseconds）
指标名如：
- process_cpu_seconds_total
- http_request_duration_seconds
- node_memory_usage_bytes
- http_requests_total
- process_cpu_seconds_total
- foobar_build_info

Prometheus 提供了以下基本的指标类型：

Counter：代表一种样本数据单调递增的指标，即只增不减，通常用来统计如服务的请求数，错误数等。
Gauge：代表一种样本数据可以任意变化的指标，即可增可减，通常用来统计如服务的CPU使用值，内存占用值等。
Histogram 和 Summary：用于表示一段时间内的数据采样和点分位图统计结果，通常用来统计请求耗时或响应大小等。

比如要求最近5分钟请求的增长量，可以用以下的 PromQL

increase(http_requests{host="host1",service="web",code="200",env="test"}[:5m])

二、时间序列

1、什么是序列数据

时间序列数据（Timeseries Data）：按照时间顺序记录系统、设备状态变化的数据被称为时序数据。
应用的场景很多，如：

无人驾驶车辆运行中要记录的经度，纬度，速度，方向，旁边物体的距离等等。每时每刻都要将数据记录下来做分析。
某一个地区的各车辆的行驶轨迹数据
传统证券行业实时交易数据
实时运维监控数据等

Prometheus 是基于时间序列存储的，首先了解一下什么是时间序列，时间序列的格式类似于（timestamp，value）这种格式，即一个时间点拥有一个对应值，例如生活中很常见的天气预报，如：[(14:00，27℃)，(15:00，28℃)，(16:00，26℃)]，就是一个单维的时间序列，这种按照时间戳和值存放的序列也被称之为向量（vector）。

2、基于时间序列数据特点

性能好
关系型数据库对于大规模数据的处理性能糟糕。NOSQL可以比较好的处理大规模数据，让依然比不上时间序列数据库。
存储成本低
高效的压缩算法，节省存储空间，有效降低1OT Prometheus有着非常高效的时间序列数据存储方法，每个采样数据仅仅占用3.5byte左右空间，上百万条时间序列，30秒间隔，保留60天，大概花了200多G（来自官方数据）

3、普罗米修斯特征

多维度数据模型
灵活的查询语言
不依赖分布式存储，单个服务器节点是自主的
以HTTP方式，通过pul模型拉去时间序列数据
也可以通过中间网关支持push模型
通过服务发现或者静态配置，来发现目标服务对象
支持多种多样的图表和界面展示

4、普罗米修斯原理架构图

Prometheus 数据采集方式也非常灵活。要采集目标的监控数据，首先需要在目标处安装数据采集组件，这被称之为 Exporter，它会在目标处收集监控数据，并暴露出一个 HTTP 接口供 Prometheus 查询，Prometheus 通过 Pull 的方式来采集数据，这和传统的 Push 模式不同。

不过 Prometheus 也提供了一种方式来支持 Push 模式，你可以将你的数据推送到 Push Gateway，Prometheus 通过 Pull 的方式从 Push Gateway 获取数据。目前的 Exporter 已经可以采集绝大多数的第三方数据，比如 Docker、HAProxy、StatsD、JMX 等等，官网有一份 Exporter 的列表。

三、实验环境准备

grafana服务器 10.0.100.128
Prometheus服务器 10.0.100.129
被监控服务器 10.0.100.130

编号	主机名称	主机IP地址	角色
1	grafana.cluster.com	10.0.100.128	grafana
2	prometheus.cluster.com	10.0.100.129	prometheus
3	agent.cluster.com	10.0.100.130	agent

①Promethus服务器
②被监控服务器（LB、Web01/Web0/、MyCAT、MySQL01/MySQL02）
③Grafana服务器（运维成像）

1.配置主机名

各配置好主机名
# hostnamectl set-hostname --static prometheus.cluster.com
三台都互相绑定IP与主机名
# vim /etc/hosts
10.0.100.128 grafana.cluster.com
10.0.100.129 prometheus.cluster.com
10.0.100.130 agent.cluster.com

2.时间同步

yum install ntpdate -y
ntpdate cn.ntp.org.cn

3.关闭防火墙，selinux

systemctl stop firewalld
systemctl disable firewalld
iptables -F

四、安装普罗米修斯

1、下载软件

https://prometheus.io/download/
https://github.com/prometheus/prometheus/releases/download/v2.21.0/prometheus-2.21.0.linux-amd64.tar.gz

2、安装普罗米修斯软件

第一步：上传到Linux服务器
第二步：解压并安装软件

# tar xf prometheus-2.21.0.linux-amd64.tar.gz -C /usr/local/
# mv /usr/local/prometheus-2.21.0.linux-amd64  /usr/local/prometheus

第三步：启动普罗米修斯软件

# /usr/local/prometheus/prometheus --config.file="/usr/local/prometheus/prometheus.yml" &注释：&连接符代表后台运行，不占用终端窗口

测试9090端口占用情况（判断是否真正的启动了）
确认端口（9090）
lsof -i:9090
ss -naltp | grep 9090

3、普罗米修斯软件界面（Web）

通过浏览器访问http://服务器IP:9090就可以访问到prometheus的主界面

系统默认监控了自己的主机信息，监控接口：通过http://服务器IP:9090/metrics可以查看到监控的数据 （开发可进行二次调用）

在web主界面可以通过关键字查询自带的监控项

五、监控远程Linux主机

监控不同需要不同的组件，mysql，haproxy组件
1、在远程linux主机（被监控端agent）上安装node_exporter组件（这样普罗米修斯就可以接收到其收集系统）
下载地址：https://prometheus.io/download/

[root@agent ~]# tar xf node_exporter-1.0.1.linux-amd64.tar.gz -C /usr/local/
[root@agent ~]# mv /usr/local/node_exporter-1.0.1.linux-amd64/ /usr/local/node_exporter里面就一个启动命令node_exporter，可以直接使用此命令启动
[root@agent ~]# nohup /usr/local/node_exporter/node_exporter &确定端口（9100）
[root@agent ~]# lsof -i:9100或：ss -naltp | grep 9100

2、通过浏览器访问http://被监控端IP:9100/metrics就可以查看到node_exporter在被监控端收集的监控信息

3、回到prometheus服务器的配置文件里添加被监控机器的配置段

在主配置文件最后加上下面三行
[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml - job_name: 'agent' #取一个job名称来代表被监控的机器static_configs:- targets: ['10.0.100.130:9100'] #这里改成被监控机器的IP，后面端口接9100改变配置文件后，重启服务
[root@prometheus ~]# pkill prometheus
[root@prometheus ~]# lsof -i:9090 #确定端口没有进程占用
[root@prometheus ~]# /usr/local/prometheus/prometheus --config.file="/usr/local/prometheus/prometheus.yml" &
[root@prometheus ~]# lsof -i:9090 #确定端口被占用，说明重启成功

4、回到web管理界面——》点Status——》点Targets——》可以看到多了一台监控目标

六、监控远程MySQL

在node_exporter的基础上，可以根据自己的需要收集其他信息

1、在被管理机agent上安装配置mysqld_exporter组件
下载地址：https://prometheus.io/download/
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz

安装mysqld_exporter组件
[root@agent ~]# tar xf mysqld_exporter-0.12.1.linux-amd64.tar.gz -C /usr/local/
[root@agent ~]# mv /usr/local/mysqld_exporter-0.12.1.linux-amd64/ /usr/local/mysqld_exporter安装mariadb数据库，并授权
[root@agent ~]# yum -y install mariadb\*
\* 表示所有安装
[root@agent ~]# systemctl restart mariadb
[root@agent ~]# systemctl enable mariadb
[root@agent ~]# mysql因为监控系统要根据账号去收集数据，所以要创建mysql账号
MariaDB [(none)]> grant select,replication client,process ON *.* to 'mysql_monitor'@'localhost' identified by '123';
MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.00 sec)创建一个mariadb配置文件，写上连接的用户名与密码（和上面的授权的用户名和密码要对应）
[root@agent ~]# vim /usr/local/mysqld_exporter/.my.cnf  ==>>手工创建
[client]
user=mysql_monitor
password=123启动mysqld_exporter
[root@agent ~]# nohup /usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf &确认端口（9104）
lsof -i:9104

2、回到prometheus服务器的配置文件里添加被监控的mariadb的配置段

在主配置文件最后再加上下面三行
[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml - job_name: 'agent1_mariadb' #取一个job名称来代表被监控的机器static_configs:- targets: ['10.0.100.130:9104'] #这里改成被监控机器的IP，后面端口接9104
[root@prometheus ~]# pkill prometheus
[root@prometheus ~]# lsof -i:9090
[root@prometheus ~]# /usr/local/prometheus/prometheus --config.file="/usr/local/prometheus/prometheus.yml" &
[root@prometheus ~]# lsof -i:9090

3、回到web管理页面——》点Status——》点Targets——》可以看到监控mariadb了

七、为普罗米修斯添加图形图像

1、安装Grafana可视化工具

Grafana是一个开源的度量分析和可视化工具，可以通过将采集的数据分析，查询，然后进行可视化的展示，并能实现报警。
网址：https://grafana.com

①在grafana服务器上安装grafana
下载地址：https://grafana.com/grafana/download

我这里选择的rpm包，下载后直接yum安装就ok
[root@grafana ~]# wget https://dl.grafana.com/oss/release/grafana-7.2.0-1.x86_64.rpm
[root@grafana ~]# yum -y install grafana-7.2.0-1.x86_64.rpm
用yum安装可以解决rpm依赖问题
[root@grafana ~]# systemctl start grafana-server
[root@grafana ~]# systemctl enable grafana-server确认端口（3000）
[root@grafana ~]# lsof -i:3000
ss -naltp | grep 3000

②通过浏览器访问http://grafana服务器IP:3000就到了登录界面，使用默认的用户名和密码

2、添加数据源

①下面把prometheus服务器收集的数据做为一个数据源添加到grafana，让grafana可以得到prometheus的数据

②设置数据源，数据源名称，IP，端口号，GET请求方式

普罗米修斯获取数据是基于GET方式

③点击齿轮图表，查看已添加的数据源

3、为添加好的数据源做图形显示

①点击加号，单击Dashboard

②单击Graph，增加图形

③选择需要呈现图形的数据（小三角=》Edit编辑）

④选择Data Source数据源，并设置好条件（如 1分钟，5分钟，15分钟的负载情况）

⑤保存图像

扩展：⑥特定条件匹配显示
node_load1{instance="10.0.100.130:9100"} 根据IP筛选
node_load15{job="agent"} 根据yml文件里的job筛选

八、Grafana图形显示MySQL监控

①在grafana上修改配置文件，并下载安装mysql监控的dashboard（包含相关json文件，这些json文件可以看作是开发人员开发的一个监控模板）
参考网址：https://github.com/percona/grafana-dashboards

在grafana配置文件里最后加上以下三行
[root@grafana ~]# vim /etc/grafana/grafana.ini
[dashboards.json]
enabled=true
path=/var/lib/grafana/dashboards[root@grafana ~]# cd /var/lib/grafana/
[root@grafana grafana]# git clone https://github.com/percona/grafana-dashboards.git
[root@grafana grafana]# cp -r grafana-dashboards/dashboards /var/lib/grafana/
重启grafana服务
[root@grafana grafana]# systemctl restart grafana-server

②在grafana图形界面导入相关json文件
下载地址：https://github.com/percona/grafana-dashboards

③选择MySQL.Overview，点击import导入后，Prometheus数据源
注：如果报prometheus数据源找不到，因为这些json文件里默认要找的是叫Prometheus的数据源，按照我的做法的话，上面创建的数据源就是叫Prometheus。（如果不是请修改，第一个字母P是大写，在Configuration——》DataSource里修改Name）

④展示结果

报错：Panel plugin not found: pmm-singlestat-panel
文件均在上面的github可以获取到

mv  pmm-singlestat-panel /var/lib/grafana/plugins/pmm-singlestat-panel在grafana的defaults.ini或者grafana.ini下添加这个 指定插件地址
vim /etc/grafana/grafana.ini
[plugin.singlestat]
path = /var/lib/grafana/plugins/pmm-singlestat-panel

九、Grafana+onealert报警

prometheus报警需要使用alertmanager这个组件，而且报警规则需要手动编写（对运维来说不友好）。所以我这里选用grafana+onealert报警。

注意：实现报警前把所有机器时间同步再检查一遍.

①先在onealert（睿象云）里添加grafana应用（申请onealert账号在zabbix已经讲过）

一、在Grafana中配置Webhook URL
1、在Grafana中创建Notification channel，选择类型为Webhook；
2、推荐选中Send on all alerts和Include image，Cloud Alert体验更佳；
3、将第一步中生成的Webhook URL填入Webhook settings Url；
URL格式：http://api.aiops.com/alert/api/event/grafana/v1/xxxxxxxxx/
4、Http Method选择POST；
5、Send Test&Save；

二、设置Webhook信息

三、将配置的Webhook Notification Channel添加到Grafana Alert中
Edit——》Alert——》

负载大于0.5触发，在睿象云设置通知对象
CPU负载测试：cat /dev/urandom | md5sum
for i inseq 1 $(cat /proc/cpuinfo |grep “physical id” |wc -l); do dd if=/dev/zero of=/dev/null & done

十、学习PromQL

Prometheus 提供了可视化的 Web UI 方便我们操作，直接访问 http://localhost:9090/ 即可，它默认会跳转到 Graph 页面：

10.1 数据模型

1、数据类型

要学习 PromQL，首先我们需要了解下 Prometheus 的数据模型，一条 Prometheus 数据由一个指标名称（metric）和 N 个标签（label，N >= 0）组成的，比如下面这个例子：

promhttp_metric_handler_requests_total{code="200",instance="192.168.0.107:9090",job="prometheus"} 106

这条数据的指标名称为 promhttp_metric_handler_requests_total，并且包含三个标签 code、instance 和 job，这条记录的值为 106。

上面说过，Prometheus 是一个时序数据库，相同指标相同标签的数据构成一条时间序列。如果以传统数据库的概念来理解时序数据库，可以把指标名当作表名，标签是字段，timestamp 是主键，还有一个 float64 类型的字段表示值（Prometheus 里面所有值都是按 float64 存储）。

Prometheus 的数据分成四大类：

Counter
Counter 用于计数，例如：请求次数、任务完成数、错误发生次数，这个值会一直增加，不会减少。
Gauge
Gauge 就是一般的数值，可大可小，例如：温度变化、内存使用变化。
Histogram
Histogram 是直方图，或称为柱状图，常用于跟踪事件发生的规模，例如：请求耗时、响应大小。它特别之处是可以对记录的内容进行分组，提供 count 和 sum 的功能。
Summary
Summary 和 Histogram 十分相似，也用于跟踪事件发生的规模，不同之处是，它提供了一个 quantiles 的功能，可以按百分比划分跟踪的结果。例如：quantile 取值 0.95，表示取采样值里面的 95% 数据。

2、选择器

（1）Instant vector selectors
这里不仅可以使用 = 号，还可以使用 !=、=~、!~，比如下面这样：

up{job!="prometheus"}
up{job=~"server|mysql"}
up{job=~"192\.168\.0\.107.+"}
#=~ 是根据正则表达式来匹配，必须符合 RE2 的语法。

（2）Range vector selectors，它可以查出一段时间内的所有数据

http_requests_total[5m]

这条语句查出 5 分钟内所有抓取的 HTTP 请求数，注意它返回的数据类型是 Range vector，没办法在 Graph 上显示成曲线图，一般情况下，会用在 Counter 类型的指标上，并和 rate() 或 irate() 函数一起使用（注意 rate 和 irate 的区别）。

# 计算的是每秒的平均值，适用于变化很慢的 counter
# per-second average rate of increase, for slow-moving counters
rate(http_requests_total[5m])  # 计算的是每秒瞬时增加速率，适用于变化很快的 counter
# per-second instant rate of increase, for volatile and fast-moving counters
irate(http_requests_total[5m])

此外，PromQL 还支持 count、sum、min、max、topk 等聚合操作，还支持 rate、abs、ceil、floor 等一堆的内置函数，更多的例子，还是上官网学习吧。如果感兴趣，我们还可以把 PromQL 和 SQL 做一个对比，会发现 PromQL 语法更简洁，查询性能也更高。

十一、集群

在 Prometheus 中，一个指标（即拥有唯一的标签集的 metric）和一个（timestamp，value）组成了一个样本（sample），Prometheus 将采集的样本放到内存中，默认每隔2小时将数据压缩成一个 block，持久化到硬盘中，样本的数量越多，Prometheus占用的内存就越高，因此在实践中，一般不建议用区分度（cardinality）太高的标签，比如：用户IP，ID，URL地址等等，否则结果会造成时间序列数以指数级别增长（label数量相乘）。

除了控制样本数量和大小合理之外，还可以通过降低 storage.tsdb.min-block-duration 来加快数据落盘时间和增加 scrape interval 的值提高拉取间隔来控制 Prometheus 的占用内存。

通过声明配置文件中的 scrape_configs 来指定 Prometheus 在运行时需要拉取指标的目标，目标实例需要实现一个可以被 Prometheus 进行轮询的端点，而要实现一个这样的接口，可以用来给 Prometheus 提供监控样本数据的独立程序一般被称作为 Exporter，比如用来拉取操作系统指标的 Node Exporter，它会从操作系统上收集硬件指标，供 Prometheus 来拉取。

在开发环境，往往只需要部署一个 Prometheus 实例便可以满足数十万指标的收集。但在生产环境中，应用和服务实例数量众多，只部署一个 Prometheus 实例通常是不够的，比较好的做法是部署多个Prometheus实例，每个实例通过分区只拉取一部分指标，例如Prometheus Relabel配置中的hashmod功能，可以对拉取目标的地址进行hashmod，再将结果匹配自身ID的目标保留：

relabel_configs:
- source_labels: [__address__]modulus:       3target_label:  __tmp_hashaction:        hashmod
- source_labels: [__tmp_hash]regex:         $(PROM_ID)action:        keep

或者说，我们想让每个 Prometheus 拉取一个集群的指标，一样可以用 Relabel 来完成：

relabel_configs:
- source_labels:  ["__meta_consul_dc"]regex: "dc1"action: keep

现在每个 Prometheus 都有各自的数据了，那么怎么把他们关联起来，建立一个全局的视图呢？官方提供了一个做法：联邦集群（federation），即把 Prometheuse Server 按照树状结构进行分层，根节点方向的 Prometheus 将查询叶子节点的 Prometheus 实例，再将指标聚合返回。

不过显然易见的时，使用联邦集群依然不能解决问题，首先单点问题依然存在，根节点挂了的话查询将会变得不可用，如果配置多个父节点的话又会造成数据冗余和抓取时机导致数据不一致等问题，而且叶子节点目标数量太多时，更加会容易使父节点压力增大以至打满宕机，除此之外规则配置管理也是个大麻烦。

还好社区出现了一个 Prometheus 的集群解决方案：Thanos，它提供了全局查询视图，可以从多台Prometheus查询和聚合数据，因为所有这些数据均可以从单个端点获取。

Querier 收到一个请求时，它会向相关的 Sidecar 发送请求，并从他们的 Prometheus 服务器获取时间序列数据。
它将这些响应的数据聚合在一起，并对它们执行 PromQL 查询。它可以聚合不相交的数据也可以针对 Prometheus 的高可用组进行数据去重。

再来说到存储，Prometheus 查询的高可用可以通过水平扩展+统一查询视图的方式解决，那么存储的高可用要怎么解决呢？在 Prometheus 的设计中，数据是以本地存储的方式进行持久化的，虽然本地持久化方便，当也会带来一些麻烦，比如节点挂了或者 Prometheus 被调度到其他节点上，就会意味着原节点上的监控数据在查询接口中丢失，本地存储导致了 Prometheus 无法弹性扩展，为此 Prometheus 提供了 Remote Read 和 Remote Write 功能，支持把 Prometheus 的时间序列远程写入到远端存储中，查询时可以从远端存储中读取数据。

其中一个例子中就是M3DB，M3DB是一个分布式的时间序列数据库，它提供了Prometheus的远程读写接口，当一个时间序列写入到M3DB集群后会按照分片（Shard）和复制（Replication Factor）参数把数据复制到集群的其他节点上，实现存储高可用。除了M3DB外，Prometheus目前还支持InfluxDB、OpenTSDB等作为远程写的端点。

解决了 Prometheus 的高可用，再来关注一下 Prometheus 如何对监控目标进行采集，当监控节点数量较小时，可以通过 Static Config 将目标主机列表写到 Prometheus 的拉取配置中，但如果目标节点一多的话这种方式管理便有很大问题了，而且在生产环境中，服务实例的IP通常不是固定的，这时候用静态配置就没办法对目标节点进行有效管理，这时候 Prometheus 提供的服务发现功能便可以有效解决监控节点状态变化的问题，在这种模式下，Prometheus 会到注册中心监听查询节点列表，定期对节点进行指标的拉取。

如果对服务发现有更灵活的需求，Prometheus 也支持基于文件的服务发现功能，这时候我们可以从多个注册中心中获取节点列表，再通过自己的需求进行过滤，最终写入到文件，这时候 Prometheus 检测到文件变化后便能动态地替换监控节点，再去拉取目标了。

用 Kubernetes 来搭建一套 Prometheus 的监控系统
为了部署 Prometheus 实例，需要声明 Prometheus 的 StatefulSet，Pod 中包括了三个容器，分别是 Prometheus 以及绑定的 Thanos Sidecar，最后再加入一个 watch 容器，来监听 prometheus 配置文件的变化，当修改 ConfigMap 时就可以自动调用Prometheus 的 Reload API 完成配置加载，这里按照之前提到的数据分区的方式，在Prometheus 启动前加入一个环境变量 PROM_ID，作为 Relabel 时 hashmod 的标识，而 POD_NAME 用作 Thanos Sidecar 给 Prometheus 指定的 external_labels.replica 来使用：


apiVersion: apps/v1
kind: StatefulSet
metadata:name: prometheuslabels:app: prometheus
spec:serviceName: "prometheus"updateStrategy:type: RollingUpdatereplicas: 3selector:matchLabels:app: prometheustemplate:metadata:labels:app: prometheusthanos-store-api: "true"spec:serviceAccountName: prometheusvolumes:- name: prometheus-configconfigMap:name: prometheus-config- name: prometheus-datahostPath:path: /data/prometheus- name: prometheus-config-sharedemptyDir: {}containers:- name: prometheusimage: prom/prometheus:v2.11.1args:- --config.file=/etc/prometheus-shared/prometheus.yml- --web.enable-lifecycle- --storage.tsdb.path=/data/prometheus- --storage.tsdb.retention=2w- --storage.tsdb.min-block-duration=2h- --storage.tsdb.max-block-duration=2h- --web.enable-admin-apiports:- name: httpcontainerPort: 9090volumeMounts:- name: prometheus-config-sharedmountPath: /etc/prometheus-shared- name: prometheus-datamountPath: /data/prometheuslivenessProbe:httpGet:path: /-/healthyport: http- name: watchimage: watchargs: ["-v", "-t", "-p=/etc/prometheus-shared", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"]volumeMounts:- name: prometheus-config-sharedmountPath: /etc/prometheus-shared- name: thanosimage: improbable/thanos:v0.6.0command: ["/bin/sh", "-c"]args:- PROM_ID=`echo $POD_NAME| rev | cut -d '-' -f1` /bin/thanos sidecar--prometheus.url=http://localhost:9090--reloader.config-file=/etc/prometheus/prometheus.yml.tmpl--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.ymlenv:- name: POD_NAMEvalueFrom:fieldRef:fieldPath: metadata.nameports:- name: http-sidecarcontainerPort: 10902- name: grpccontainerPort: 10901volumeMounts:- name: prometheus-configmountPath: /etc/prometheus- name: prometheus-config-sharedmountPath: /etc/prometheus-shared

因为 Prometheus 默认是没办法访问 Kubernetes 中的集群资源的，因此需要为之分配RBAC：


apiVersion: v1
kind: ServiceAccount
metadata:name: prometheus
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:name: prometheusnamespace: defaultlabels:app: prometheus
rules:
- apiGroups: [""]resources: ["services", "pods", "nodes", "nodes/proxy", "endpoints"]verbs: ["get", "list", "watch"]
- apiGroups: [""]resources: ["configmaps"]verbs: ["create"]
- apiGroups: [""]resources: ["configmaps"]resourceNames: ["prometheus-config"]verbs: ["get", "update", "delete"]
- nonResourceURLs: ["/metrics"]verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:name: prometheusnamespace: defaultlabels:app: prometheus
subjects:
- kind: ServiceAccountname: prometheusnamespace: default
roleRef:kind: ClusterRolename: prometheusapiGroup: ""

接着 Thanos Querier 的部署比较简单，需要在启动时指定 store 的参数为dnssrv+thanos-store-gateway.default.svc来发现Sidecar：


apiVersion: apps/v1
kind: Deployment
metadata:labels:app: thanos-queryname: thanos-query
spec:replicas: 2selector:matchLabels:app: thanos-queryminReadySeconds: 5strategy:type: RollingUpdaterollingUpdate:maxSurge: 1maxUnavailable: 1template:metadata:labels:app: thanos-queryspec:containers:- args:- query- --log.level=debug- --query.timeout=2m- --query.max-concurrent=20- --query.replica-label=replica- --query.auto-downsampling- --store=dnssrv+thanos-store-gateway.default.svc- --store.sd-dns-interval=30simage: improbable/thanos:v0.6.0name: thanos-queryports:- containerPort: 10902name: http- containerPort: 10901name: grpclivenessProbe:httpGet:path: /-/healthyport: http
---
apiVersion: v1
kind: Service
metadata:labels:app: thanos-queryname: thanos-query
spec:type: LoadBalancerports:- name: httpport: 10901targetPort: httpselector:app: thanos-query
---
apiVersion: v1
kind: Service
metadata:labels:thanos-store-api: "true"name: thanos-store-gateway
spec:type: ClusterIPclusterIP: Noneports:- name: grpcport: 10901targetPort: grpcselector:thanos-store-api: "true"
部署Thanos Ruler：apiVersion: apps/v1
kind: Deployment
metadata:labels:app: thanos-rulename: thanos-rule
spec:replicas: 1selector:matchLabels:app: thanos-ruletemplate:metadata:labels:labels:app: thanos-rulespec:containers:- name: thanos-ruleimage: improbable/thanos:v0.6.0args:- rule- --web.route-prefix=/rule- --web.external-prefix=/rule- --log.level=debug- --eval-interval=15s- --rule-file=/etc/rules/thanos-rule.yml- --query=dnssrv+thanos-query.default.svc- --alertmanagers.url=dns+http://alertmanager.defaultports:- containerPort: 10902name: httpvolumeMounts:- name: thanos-rule-configmountPath: /etc/rulesvolumes:- name: thanos-rule-configconfigMap:name: thanos-rule-config

部署 Pushgateway：


apiVersion: apps/v1
kind: Deployment
metadata:labels:app: pushgatewayname: pushgateway
spec:replicas: 15selector:matchLabels:app: pushgatewaytemplate:metadata:labels:app: pushgatewayspec:containers:- image: prom/pushgateway:v1.0.0name: pushgatewayports:- containerPort: 9091name: httpresources:limits:memory: 1Girequests:memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:labels:app: pushgatewayname: pushgateway
spec:type: LoadBalancerports:- name: httpport: 9091targetPort: httpselector:app: pushgateway

部署 Alertmanager：


apiVersion: apps/v1
kind: Deployment
metadata:name: alertmanager
spec:replicas: 3selector:matchLabels:app: alertmanagertemplate:metadata:name: alertmanagerlabels:app: alertmanagerspec:containers:- name: alertmanagerimage: prom/alertmanager:latestargs:- --web.route-prefix=/alertmanager- --config.file=/etc/alertmanager/config.yml- --storage.path=/alertmanager- --cluster.listen-address=0.0.0.0:8001- --cluster.peer=alertmanager-peers.default:8001ports:- name: alertmanagercontainerPort: 9093volumeMounts:- name: alertmanager-configmountPath: /etc/alertmanager- name: alertmanagermountPath: /alertmanagervolumes:- name: alertmanager-configconfigMap:name: alertmanager-config- name: alertmanageremptyDir: {}
---
apiVersion: v1
kind: Service
metadata:labels:name: alertmanager-peersname: alertmanager-peers
spec:type: ClusterIPclusterIP: Noneselector:app: alertmanagerports:- name: alertmanagerprotocol: TCPport: 9093targetPort: 9093

最后部署一下 ingress，大功告成：


apiVersion: extensions/v1beta1
kind: Ingress
metadata:name: pushgateway-ingressannotations:kubernetes.io/ingress.class: "nginx"nginx.ingress.kubernetes.io/upstream-hash-by: "$request_uri"nginx.ingress.kubernetes.io/ssl-redirect: "false"
spec:rules:- host: $(DOMAIN)http:paths:- backend:serviceName: pushgatewayservicePort: 9091path: /metrics
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:name: prometheus-ingressannotations:kubernetes.io/ingress.class: "nginx"
spec:rules:- host: $(DOMAIN)http:paths:- backend:serviceName: thanos-queryservicePort: 10901path: /- backend:serviceName: alertmanagerservicePort: 9093path: /alertmanager- backend:serviceName: thanos-ruleservicePort: 10092path: /rule- backend:serviceName: grafanaservicePort: 3000path: /grafana

访问 Prometheus 地址，监控节点状态正常：

来源：https://zhuanlan.zhihu.com/p/101184971

Prometheus监控系统相关推荐

云计算监控—Prometheus监控系统（文末赠书）
陈金窗刘政委张其栋郑少斌读完需要 20 分钟速读仅需 7 分钟本文摘自于<Prometheus 监控技术与实战>一书,从云计算时代的业务特点出发,探讨了云计算监控的目标和挑战, ...
Prometheus监控系统入门与部署
Prometheus监控系统入门与部署本文介绍新一代的监控系统 Prometheus,并指导用户如何一步一步搭建一个 Prometheus 系统. 什么是 Prometheus ? Promethe ...
【第7期】云计算监控——Prometheus监控系统
本文摘自于<Prometheus监控技术与实战>一书,从云计算时代的业务特点出发,探讨了云计算监控的目标和挑战,梳理了云资源监控的范围及监控系统实现的一般方式.接着从开源监控软件的演进出发 ...
APM - Prometheus监控系统初探
文章目录 Wiki 时序数据库 TSDB(Time Series Database) 概述下载&安装 Prometheus server 二进制文件的方式 [ 查看版本信息 ] [运行 Pr ...
Prometheus 监控系统
前言软件的开发不仅仅在于解决业务,它还需要程序尽可能的运行下去,这就涉及到了服务的稳定性.稳定性涉及很多因素,硬件软件都需要保证.为了能让这些条件更加充足,我们需要不断的收集数据,分析数据,监控数据 ...
Prometheus监控系统——前篇
目录 Prometheus简介 prometheus特点 prometheus时序数据数据来源收集数据: prometheus获取方式 prometheus生态组件 prometheus架构图 p ...
Prometheus 监控系统入门与实践
原文地址:https://www.ibm.com/developerworks/cn/cloud/library/cl-lo-prometheus-getting-started-and-practi ...
Prometheus监控系统详解
一.监控原理简介监控系统在这里特指对数据中心的监控,主要针对数据中心内的硬件和软件进行监控和告警. 从监控对象的角度来看,可以将监控分为网络监控.存储监控.服务器监控和应用监控等. 从程序设计的角度 ...
搭建普罗米修斯Prometheus监控系统
一.普罗米修斯监控概述 1.什么是普罗米修斯监控 Prometheus(由go语言(golang)开发)是一套开源的监控&报警&时间序列数据库的组合.适合监控docker容器.因为K8 ...

Prometheus监控系统