零、Prometheus是什么

Prometheus是一个开源的系统监控和报警工具，特点是

多维数据模型（时序列数据由metric名和一组key/value组成）
在多维度上灵活的查询语言(PromQl)
不依赖分布式存储，单主节点工作.
通过基于HTTP的pull方式采集时序数据
可以通过push gateway进行时序列数据推送(pushing)
可以通过服务发现或者静态配置去获取要采集的目标服务器
多种可视化图表及仪表盘支持

pull方式

Prometheus采集数据是用的pull也就是拉模型,通过HTTP协议去采集指标，只要应用系统能够提供HTTP接口就可以接入监控系统，相比于私有协议或二进制协议来说开发、简单。

push方式

对于定时任务这种短周期的指标采集，如果采用pull模式，可能造成任务结束了，Prometheus还没有来得及采集，这个时候可以使用加一个中转层，客户端推数据到Push Gateway缓存一下，由Prometheus从push gateway pull指标过来。(需要额外搭建Push Gateway，同时需要新增job去从gateway采数据)

组成及架构

Prometheus server 主要负责数据采集和存储，提供PromQL查询语言的支持
客户端sdk 官方提供的客户端类库有go、java、scala、python、ruby，其他还有很多第三方开发的类库，支持nodejs、php、erlang等
Push Gateway 支持临时性Job主动推送指标的中间网关
PromDash 使用rails开发的dashboard，用于可视化指标数据
exporters 支持其他数据源的指标导入到Prometheus，支持数据库、硬件、消息中间件、存储系统、http服务器、jmx等
alertmanager 实验性组件、用来进行报警
prometheus_cli 命令行工具
其他辅助性工具

Prometheus 是由 SoundCloud 开源监控告警解决方案。架构图如下：

如上图，Prometheus主要由以下部分组成：

Prometheus Server：用于抓取和存储时间序列化数据
Exporters：主动拉取数据的插件
Pushgateway：被动拉取数据的插件
Altermanager：告警发送模块
Prometheus web UI：界面化，也包含结合Grafana进行数据展示或告警发送

prometheus本身是一个以进程方式启动，之后以多进程和多线程实现监控数据收集、计算、查询、更新、存储的这样一个C/S模型运行模式。了解以下疑问信息

1、Prometheus是如何存储数据的？？？

prometheus采用time-series(时间序列)方式，存储在本地硬盘

prometheus本地T-S数据库以每2小时间隔来分block(块)存储，每个块又分为多个chunk文件，chunk文件用来存放采集的数据的T-S（time-series）数据，metadata和索引文件；
index文件是对metrics和labels进行索引之后存储在chunk中，chunk是作为基本存储单位，index和metadata作为子集；
prometheus平时采集到的数据先存放在内存之中，对内存消耗大，以缓存的方式可以加快搜索和访问；
在prometheus宕机时，prometheus有一种保护机制WAL，可以将数据定期存入硬盘中以chunk来表示，在重新启动时，可以恢复进内存当中。
当通过API删除序列时，删除的记录存储在单独的tombstone文件中(而不是立即从块文件中删除数据)。

一、基础环境

环境/组件	版本	下载地址
操作系统	CentOS-8.4.2105-x86_64-dvd1.iso	http://mirror.facebook.net/centos/8.4.2105/isos/x86_64/
Prometheus	2.29	https://github.com/prometheus/prometheus/releases
Grafana	8.0.3	https://dl.grafana.com/oss/release/grafana-8.0.3-1.x86_64.rpm

二、安装Prometheus

tar -zxvf prometheus-2.29.2.linux-amd64.tar.gz

/mine/prometheus-2.29.2.linux-amd64/prometheus --config.file=/mine/prometheus-2.29.2.linux-amd64/prometheus.yml --web.enable-admin-api &

访问页面：

http://192.168.113.4:9090

http://192.168.113.4:9090/alerts

http://192.168.113.4:9090/targets

http://192.168.113.4:9090/service-discovery

prometheus的help帮助：

./prometheus --help

usage: prometheus [<flags>]The Prometheus monitoring serverFlags:-h, --help                     Show context-sensitive help (also try --help-long and --help-man).--version                  Show application version.--config.file="prometheus.yml"  Prometheus configuration file path.--web.listen-address="0.0.0.0:9090"  Address to listen on for UI, API, and telemetry.--web.config.file=""       [EXPERIMENTAL] Path to configuration file that can enable TLS orauthentication.--web.read-timeout=5m      Maximum duration before timing out read of the request, and closingidle connections.--web.max-connections=512  Maximum number of simultaneous connections.--web.external-url=<URL>   The URL under which Prometheus is externally reachable (for example,if Prometheus is served via a reverse proxy). Used for generatingrelative and absolute links back to Prometheus itself. If the URL hasa path portion, it will be used to prefix all HTTP endpoints served byPrometheus. If omitted, relevant URL components will be derivedautomatically.--web.route-prefix=<path>  Prefix for the internal routes of web endpoints. Defaults to path of--web.external-url.--web.user-assets=<path>   Path to static asset directory, available at /user.--web.enable-lifecycle     Enable shutdown and reload via HTTP request.--web.enable-admin-api     Enable API endpoints for admin control actions.--web.console.templates="consoles"  Path to the console template directory, available at /consoles.--web.console.libraries="console_libraries"  Path to the console library directory.--web.page-title="Prometheus Time Series Collection and Processing Server"  Document title of Prometheus instance.--web.cors.origin=".*"     Regex for CORS origin. It is fully anchored. Example:'https?://(domain1|domain2)\.com'--storage.tsdb.path="data/"  Base path for metrics storage.--storage.tsdb.retention=STORAGE.TSDB.RETENTION  [DEPRECATED] How long to retain samples in storage. This flag has beendeprecated, use "storage.tsdb.retention.time" instead.--storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME  How long to retain samples in storage. When this flag is set itoverrides "storage.tsdb.retention". If neither this flag nor"storage.tsdb.retention" nor "storage.tsdb.retention.size" is set, theretention time defaults to 15d. Units Supported: y, w, d, h, m, s, ms.--storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE  Maximum number of bytes that can be stored for blocks. A unit isrequired, supported units: B, KB, MB, GB, TB, PB, EB. Ex: "512MB".--storage.tsdb.no-lockfile  Do not create lockfile in data directory.--storage.tsdb.allow-overlapping-blocks  Allow overlapping blocks, which in turn enables vertical compactionand vertical query merge.--storage.remote.flush-deadline=<duration>  How long to wait flushing sample on shutdown or config reload.--storage.remote.read-sample-limit=5e7  Maximum overall number of samples to return via the remote readinterface, in a single query. 0 means no limit. This limit is ignoredfor streamed response types.--storage.remote.read-concurrent-limit=10  Maximum number of concurrent remote read calls. 0 means no limit.--storage.remote.read-max-bytes-in-frame=1048576  Maximum number of bytes in a single frame for streaming remote readresponse types before marshalling. Note that client might have limiton frame size as well. 1MB as recommended by protobuf by default.--rules.alert.for-outage-tolerance=1h  Max time to tolerate prometheus outage for restoring "for" state ofalert.--rules.alert.for-grace-period=10m  Minimum duration between alert and restored "for" state. This ismaintained only for alerts with configured "for" time greater thangrace period.--rules.alert.resend-delay=1m  Minimum amount of time to wait before resending an alert toAlertmanager.--alertmanager.notification-queue-capacity=10000  The capacity of the queue for pending Alertmanager notifications.--query.lookback-delta=5m  The maximum lookback duration for retrieving metrics during expressionevaluations and federation.--query.timeout=2m         Maximum time a query may take before being aborted.--query.max-concurrency=20  Maximum number of queries executed concurrently.--query.max-samples=50000000  Maximum number of samples a single query can load into memory. Notethat queries will fail if they try to load more samples than this intomemory, so this also limits the number of samples a query can return.--enable-feature= ...      Comma separated feature names to enable. Valid options:promql-at-modifier, promql-negative-offset, remote-write-receiver,exemplar-storage, expand-external-labels. Seehttps://prometheus.io/docs/prometheus/latest/feature_flags/ for moredetails.--log.level=info           Only log messages with the given severity or above. One of: [debug,info, warn, error]--log.format=logfmt        Output format of log messages. One of: [logfmt, json]

三、安装Grafana

Prometheus默认的页面可能没有那么直观，我们可以安装grafana使监控看起来更直观！！！

yum install grafana-8.0.3-1.x86_64.rpm

systemctl enable grafana-server
systemctl start grafana-server

浏览器访问IP:3000端口，即可打开grafana页面，默认用户名密码都是admin，初次登录会要求修改默认的登录密码

四、Pushgateway

Pushgateway是一个独立的服务，Pushgateway位于应用程序发送指标和Prometheus服务器之间。

Pushgateway接收指标，然后将其作为目标被Prometheus服务器拉取。可以将其看作代理服务，或者与blackbox exporter的行为相反，它接收度量，而不是探测它们。

nohup /mine/pushgateway-1.4.1.linux-amd64/pushgateway > /mine/pushgateway-1.4.1.linux-amd64/pushgateway.log 2>&1 &

# my global config
global:scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets:# - alertmanager:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:# - "first_rules.yml"# - "second_rules.yml"##remote_write:- url: "http://192.168.60.100:8090/prometheus"  # 对应prometheusbeat对应IP# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: 'prometheus'# metrics_path defaults to '/metrics'# scheme defaults to 'http'.static_configs:- targets: ['localhost:9090']- job_name: 'pushgateway'scrape_interval: 10s # 每过10秒拉取一次honor_labels: truestatic_configs:- targets: ['192.168.60.100:9091']labels:instance: pushgateway

全局配置：
global 属于全局的默认配置，它主要包含 4 个属性，

scrape_interval: 拉取目标的频率。
scrape_timeout: 拉取一个 target 的超时时间。
evaluation_interval: 执行 rules 的时间间隔。
external_labels: 额外的属性，会添加到拉取的数据并存到数据库中。

Prometheus配置

常用参数详解

[root@localhost ~]# prometheus -h

--config.file="prometheus.yml" #指定配置文件

--web.listen-address="0.0.0.0:9090" #监听端口

--web.max-connections=512 #默认最大连接数512

--storage.tsdb.path="data/" #默认的存储路径: data目录下

--storage.tsdb.retention.time=15d #默认的数据保留时间：15天。

--alertmanager.timeout=10s #把报警发送给alertmanager的超时限制10s

--query.timeout=2m #查询超时时间限制默认为2min，超过自动被kill掉。可以结合grafana的限时配置如60s

--query.max-concurrency=20 #并发查询数prometheus的默认采集指标中有一项prometheus_engine_queries_concurrent_max可以拿到最大查询并发数及查询情况

Prometheus配置中变更最频繁的是Targets的变更，原因是被监控对象的频繁、动态变化。

对于Targets的配置，Prometheus提供了诸多配置方法，包括static_configs和诸多基于服务发现（Service Discovery）的配置。包括：

static_configs
file_sd_configs
serverset_sd_configs
azure_sd_configs
consul_sd_configs
dns_sd_configs
ec2_sd_configs
openstack_sd_configs
gce_sd_configs
kubernetes_sd_configs
marathon_sd_configs
nerve_sd_configs
triton_sd_configs

五、node_exporter

1、在Prometheus的架构设计中，Prometheus Server并不直接服务监控特定的目标，其主要任务负责数据的收集，存储并且对外提供数据查询支持。因此为了能够能够监控到某些东西，如主机的CPU使用率，我们需要使用到Exporter。Prometheus周期性的从exporter暴露的HTTP服务地址（通常是/metrics）拉取监控样本数据。

从上面的描述中可以看出exporter可以是一个相对开放的概念，其可以是一个独立运行的程序独立于监控目标以外，也可以是直接内置在监控目标中。只要能够向Prometheus提供标准格式的监控样本数据(TS时间序列)即可。

前面都是废话[皮]，正题开始。这里为了能够采集到主机的运行指标如CPU, 内存，磁盘等信息。就需要用到node_exporter。

2、下载node_exporter

https://github.com/prometheus/node_exporter/releases/tag/v1.1.2

tar -zxvf node_exporter-1.1.2.linux-amd64.tar.gz

nohup /mine/node_exporter-1.1.2.linux-amd64/node_exporter > /mine/node_exporter-1.1.2.linux-amd64/node_exporter.log 2>&1  &

3、在浏览器端验证（因为端口（9100）暴露出来了，可以通过浏览器查看采集到的数据）

ps:每一个监控指标之前都会有一段类似于如下形式的信息：

# HELP node_schedstat_waiting_seconds_total Number of seconds spent by processing waiting for this CPU.
# TYPE node_schedstat_waiting_seconds_total counter
node_schedstat_waiting_seconds_total{cpu="0"} 4.057652479
node_schedstat_waiting_seconds_total{cpu="1"} 3.695532762
node_schedstat_waiting_seconds_total{cpu="2"} 3.681306284
node_schedstat_waiting_seconds_total{cpu="3"} 6.591063094

HELP：解释当前指标的含义
TYPE：说明当前指标的数据类型

（4）监控指标说明（每个版本对应的指标不一样，但核心含义没变多少）

node_boot_time：系统启动时间
node_cpu：系统CUP使用情况
node_disk_*：磁盘io
node_filesystem_*：文件系统使用量
node_load1：系统负载
node_memeory_*：系统内存使用量
node_network_*：网络宽带
node_time：当前系统时间
go_*： node exporter中go相关指标
*process_ ：**node exporter自身进程相关指标

六、Prothetheus查询

Prometheus提供一个函数式的表达式语言PromQL (Prometheus Query Language)，可以使用户实时地查找和聚合时间序列数据。表达式计算结果可以在图表中展示，也可以在表达式浏览器中以表格形式展示，或者作为数据源, 以HTTP API的方式提供给外部系统使用。

1、表达式语言数据类型

在Prometheus的表达式语言中，任何表达式或者子表达式都可以归为四种类型：

instant vector 瞬时向量：它是指在同一时刻，抓取的所有度量指标数据。这些度量指标数据的key都是相同的，也即相同的时间戳
range vector 范围向量：它是指在任何一个时间范围内，抓取的所有度量指标数据
scalar 标量：一个简单的浮点值
string 字符串：一个当前没有被使用的简单字符串

根据用例（例如在绘制图形或显示表达式的输出时），由于用户指定的表达式的结果，其中只有某些类型是合法的。例如，返回瞬时向量的表达式是唯一可以直接绘制图形的类型。

2、字面量

2.1 字符串字面量

字符串可以用单引号，双引号或反引号指定为文字。PromQL遵循与Go相同的转义规则。在单引号，双引号中，反斜杠成为了转义字符，后面可以跟着a,b, f, n, r, t, v或者\。可以使用八进制(\nnn)或者十六进制(\xnn, \unnnn和\Unnnnnnnn)提供特定字符。在反引号内不处理转义字符。与Go不同，Prometheus不会丢弃反引号中的换行符。例如：

"this is a string"
'these are unescaped: \n \\ \t'
`these are not unescaped: \n ' " \t"'`

2.2 浮点数字面量

标量浮点值可以直接写成形式[-](digits)[.(digits)]。

3、时间序列选择器

3.1 瞬时向量选择器

瞬时向量选择器允许在给定时间戳（即时）为每个选择一组时间序列和单个样本值：在最简单的形式中，仅指定度量名称。这会生成包含具有此度量标准名称的所有时间序列的元素的即时向量。

下面这个例子选择所有时间序列度量名称为http_requests_total的样本数据：

http_requests_total

通过在度量指标后面增加{}一组标签可以进一步地过滤这些时间序列数据。

此示例仅选择具有http_requests_total度量标准名称的时间系列，该名称也将job标签设置为prometheus，并将其group标签设置为canary：

http_requests_total{job="prometheus",group="canary"}

可以采用不匹配的标签值也是可以的，或者用正则表达式不匹配标签。标签匹配操作如下所示：

= : 精确地匹配标签给定的值
!= : 不等于给定的标签值
=~ : 正则表达匹配给定的标签值
!~ : 给定的标签值不符合正则表达式

例如：度量指标名称为http_requests_total，正则表达式匹配标签environment为staging, testing, development的值，且http请求方法不等于GET。

http_requests_total{environment=~"staging|testing|development",method!="GET"}

匹配空标签值的标签匹配器也可以选择没有设置任何标签的所有时间序列数据。正则表达式完全匹配。可以为同一标签名称提供多个匹配器。

向量选择器必须指定一个名称或至少一个与空字符串不匹配的标签匹配器。以下表达式是非法的：

{job=~".*"} # Bad!

相反，这些表达式是有效的，因为它们都有一个与空标签值不匹配的选择器。

{job=~".+"}              # Good!
{job=~".*",method="get"} # Good!

标签匹配器能够被应用到度量指标名称，使用__name__标签筛选度量指标名称。例如：表达式http_requests_total等价于{__name__="http_requests_total"}。其他的匹配器，如：= ( !=, =~, !~)都可以使用。下面的表达式选择了度量指标名称以job:开头的时间序列数据：

{__name__=~"job:.*"}

Prometheus中的所有正则表达式都使用RE2语法。

3.2 范围向量选择器

范围向量的工作方式与即时向量相同，不同之处在于它们从当前即时选择回采样范围。在语法上，范围持续时间附加在向量选择器末尾的方括号（[]）中，指定为每个结果范围向量元素提取多长时间值。持续时间指定为数字，单位为：

ms - ms
s - seconds
m - minutes
h - hours
d - days
w - weeks
y - years

在此示例中，我们选择在过去5分钟内为度量标准名称为http_requests_total且job标签设置为prometheus的所有时间序列记录的所有值：

http_requests_total{job="prometheus"}[5m]

必须使用整数时间，且能够将多个不同级别的单位进行串联组合，以时间单位从大到小为顺序。例如1h30m，但不能使用1.5h。

3.3 偏移修饰符

这个offset偏移修饰符允许在查询中改变单个瞬时向量和范围向量中的时间偏移。例如，以下表达式返回过去相对于当前查询评估时间5分钟的http_requests_total值：

http_requests_total offset 5m

注意：offset偏移修饰符必须直接跟在选择器后面，例如：以下是正确的：

sum(http_requests_total{method="GET"} offset 5m)

然而，下面这种情况是不正确的:

sum(http_requests_total{method="GET"}) offset 5m // INVALID.

同样适用于范围向量。这将返回http_requests_total一周前的5分钟速率：

rate(http_requests_total[5m] offset 1w)

操作符

Prometheus 查询语句中，支持常见的各种表达式操作符，例如

算术运算符:

支持的算术运算符有 +，-，*，/，%，^, 例如 http_requests_total * 2 表示将 http_requests_total 所有数据 double 一倍。

比较运算符:

支持的比较运算符有 ==，!=，>，<，>=，<=, 例如 http_requests_total > 100 表示 http_requests_total 结果中大于 100 的数据。

逻辑运算符:

支持的逻辑运算符有 and，or，unless, 例如 http_requests_total == 5 or http_requests_total == 2 表示 http_requests_total 结果中等于 5 或者 2 的数据。

聚合运算符:

支持的聚合运算符有 sum，min，max，avg，stddev，stdvar，count，count_values，bottomk，topk，quantile，, 例如 max(http_requests_total) 表示 http_requests_total 结果中最大的数据。

注意，和四则运算类型，Prometheus 的运算符也有优先级，它们遵从（^）> (*, /, %) > (+, -) > (==, !=, <=, <, >=, >) > (and, unless) > (or) 的原则。

4、注释

PromQL支持以＃开头的行注释。

＃这是一条评论

5、指标类型

Prometheus定义了4中不同的指标类型(metric type)：Counter（计数器）、Gauge（仪表盘）、Histogram（直方图）、Summary（摘要）

Counter：只增不减的计数器

Counter类型的指标其工作方式和计数器一样，只增不减（除非系统发生重置）。常见的监控指标，如http_requests_total，node_cpu都是Counter类型的监控指标。一般在定义Counter类型指标的名称时推荐使用_total作为后缀。

Counter是一个简单但有强大的工具，例如我们可以在应用程序中记录某些事件发生的次数，通过以时序的形式存储这些数据，我们可以轻松的了解该事件产生速率的变化。PromQL内置的聚合操作和函数可以用户对这些数据进行进一步的分析：

例如，通过rate()函数获取HTTP请求量的增长率：

rate(http_requests_total[5m])

查询当前系统中，访问量前10的HTTP地址：

topk(10, http_requests_total)

Gauge：可增可减的仪表盘

与Counter不同，Gauge类型的指标侧重于反应系统的当前状态。因此这类指标的样本数据可增可减。常见指标如：node_memory_MemFree（主机当前空闲的内容大小）、node_memory_MemAvailable（可用内存大小）都是Gauge类型的监控指标。

通过Gauge指标，用户可以直接查看系统的当前状态：

node_memory_MemFree

对于Gauge类型的监控指标，通过PromQL内置函数delta()可以获取样本在一段时间返回内的变化情况。例如，计算CPU温度在两个小时内的差异：

delta(cpu_temp_celsius{host="zeus"}[2h])

还可以使用deriv()计算样本的线性回归模型，甚至是直接使用predict_linear()对数据的变化趋势进行预测。例如，预测系统磁盘空间在4个小时之后的剩余情况：

predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600)

使用Histogram和Summary分析数据分布情况

除了Counter和Gauge类型的监控指标以外，Prometheus还定义分别定义Histogram和Summary的指标类型。Histogram和Summary主用用于统计和分析样本的分布情况。

在大多数情况下人们都倾向于使用某些量化指标的平均值，例如CPU的平均使用率、页面的平均响应时间。这种方式的问题很明显，以系统API调用的平均响应时间为例：如果大多数API请求都维持在100ms的响应时间范围内，而个别请求的响应时间需要5s，那么就会导致某些WEB页面的响应时间落到中位数的情况，而这种现象被称为长尾问题。

为了区分是平均的慢还是长尾的慢，最简单的方式就是按照请求延迟的范围进行分组。例如，统计延迟在0~10ms之间的请求数有多少而10~20ms之间的请求数又有多少。通过这种方式可以快速分析系统慢的原因。Histogram和Summary都是为了能够解决这样问题的存在，通过Histogram和Summary类型的监控指标，我们可以快速了解监控样本的分布情况。

七、Prometheus基于文件的服务发现

对于Prometheus这一类基于Pull模式的监控系统，显然也无法继续使用的static_configs的方式静态的定义监控目标。而对于Prometheus而言其解决方案就是引入一个中间的代理人（服务注册中心），这个代理人掌握着当前所有监控目标的访问信息，Prometheus只需要向这个代理人询问有哪些监控目标控即可，这种模式被称为服务发现。

用户可以通过JSON或者YAML格式的文件，定义所有的监控目标。

基于JSON文件服务发现注册示例：

vim prometheus.yml# my global config
global:scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets:# - alertmanager:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:# - "first_rules.yml"# - "second_rules.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: 'prometheus'# metrics_path defaults to '/metrics'# scheme defaults to 'http'.static_configs:- targets: ['localhost:9090']- job_name: 'pushgateway'scrape_interval: 10s # 每过10秒拉取一次honor_labels: truestatic_configs:- targets: ['192.168.60.100:9091']labels:instance: pushgateway- job_name: 'node-exporter'file_sd_configs:- files: ['targets/config.json']   #json文件位置refresh_interval: 5s           #文件刷新

编辑JSON文件：/mine/config.json

[{"targets": ["192.168.184.129:9100"],"labels": {"hostname": "test1"}},{"targets": ["192.168.184.128:9100"],"labels": {"hostname": "test2"}}
]

八、将数据主动推送到PushGateway

String url = "192.168.60.100:9091";PushGateway pg = new PushGateway(url);Gauge gauge = Gauge.build("my_metrics", "This is my custom metric.").labelNames("app","job_name","timestamp").create();int count = 0;while (true){CollectorRegistry registry = new CollectorRegistry(true);SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");simpleDateFormat.setTimeZone(TimeZone.getTimeZone("Asia/Shanghai"));gauge.labels("my_app" + count ++,"my_metrics",simpleDateFormat.format(System.currentTimeMillis())).set(Math.random());gauge.labels("my_app" + count ++,"my_metrics",simpleDateFormat.format(System.currentTimeMillis())).set(Math.random());gauge.labels("my_app" + count ++,"my_metrics",simpleDateFormat.format(System.currentTimeMillis())).set(Math.random());gauge.labels("my_app" + count ++,"my_metrics",simpleDateFormat.format(System.currentTimeMillis())).set(Math.random());gauge.labels("my_app" + count ++,"my_metrics",simpleDateFormat.format(System.currentTimeMillis())).set(Math.random());gauge.register(registry);Map<String, String> groupingKey = new HashMap<String, String>();groupingKey.put("instance", "my_instance");pg.pushAdd(registry, "my_job", groupingKey);TimeUnit.MILLISECONDS.sleep(200);}

        String url = "192.168.60.100:9091";PushGateway pushGateway = new PushGateway(url);Counter counter = Counter.build("my_counter","this is a http request counter").labelNames("cpu","metrics").create();while (true){CollectorRegistry registry = new CollectorRegistry(true);Counter.Child labels = counter.labels("cpu0", "my_metrics");labels.inc(20);counter.register(registry);Map<String,String> groupingKey = new HashMap<>();groupingKey.put("instance","my_counter_instance");pushGateway.pushAdd(registry,"my_counter_job",groupingKey);}

九、Prometheus 删除指定 Metric

Prometheus 在 2.0 版本以后已经提供了一个简单的管理接口，可以用来删除这些坏 Metric 数据。

下面我们来一起看一下 Prometheus 的管理 API 接口，官方到现在一共提供了三个接口，对应的分别是快照功能、数据删除功能、数据清理功能，想要使用 API 需要先添加启动参数 --web.enable-admin-api 打开这个接口，默认这个接口是关闭的。

快速启动 Prometheus 可以使用如下命令，详细规范的启动方式可以参考之前的文章。

./prometheus --web.enable-admin-api

1、数据删除

使用数据删除接口可以删除一定时间范围内的 Metric 数据。实际的数据仍然存在于磁盘上，并在将来的压缩中清除，也可以通过数据清理接口显式地清除。

如果删除成功，会返回 204 。接口如下：

POST /api/v1/admin/tsdb/delete_series
PUT /api/v1/admin/tsdb/delete_series

这个接口可以使用 3 个参数，分别如下：

match[]=<series_selector> : Metric 的名称
start=<rfc3339 | unix_timestamp> : 开始的时间戳
end=<rfc3339 | unix_timestamp> : 结束的时间戳

如果没有指定开始和结束时间将清除数据库中匹配的所有数据。

接下来举几个例子

删除指定 Metric 名称的全部数据

curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?match[]=node_cpu_seconds_total'

删除指定 Metric 名称和特定 label 名称的全部数据

curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?match[]=node_cpu_seconds_total{mode="idle"}'

删除指定时间范围内的 Metric 数据

curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?start=1578301194&end=1578301694&match[]=node_cpu_seconds_total{mode="idle"}'

2、数据清理

数据清理会从磁盘删除已经被 delete_series 接口删除的数据，并清理现有的 tombstones。可以在使用 delete_series 接口删除数据之后使用它来释放空间。

如果清理成功，会返回 204 。

POST /api/v1/admin/tsdb/clean_tombstones
PUT /api/v1/admin/tsdb/clean_tombstones

示例

curl -X POST http://127.0.0.1:9090/api/v1/admin/tsdb/clean_tombstones

这个接口不需要参数。

注意事项：

由于服务器的时间与本机的时间不一致，会报一下warning：

解决方案：

ntpdate ntp.api.bz

十、prometheus的存储

Prometheus 2.x 默认将时间序列数据库保存在本地磁盘中，同时也可以将数据保存到任意第三方的存储服务中。

本地存储

Prometheus 采用自定义的存储格式将样本数据保存在本地磁盘当中。

存储格式

Prometheus 按照两个小时为一个时间窗口，将两小时内产生的数据存储在一个块（Block）中。每个块都是一个单独的目录，里面含该时间窗口内的所有样本数据（chunks），元数据文件（meta.json）以及索引文件（index）。其中索引文件会将指标名称和标签索引到样板数据的时间序列中。此期间如果通过 API 删除时间序列，删除记录会保存在单独的逻辑文件 tombstone 当中。

当前样本数据所在的块会被直接保存在内存中，不会持久化到磁盘中。为了确保 Prometheus 发生崩溃或重启时能够恢复数据，Prometheus 启动时会通过预写日志（write-ahead-log(WAL)）重新记录，从而恢复数据。预写日志文件保存在 wal 目录中，每个文件大小为 128MB。wal 文件包括还没有被压缩的原始数据，所以比常规的块文件大得多。一般情况下，Prometheus 会保留三个 wal 文件，但如果有些高负载服务器需要保存两个小时以上的原始数据，wal 文件的数量就会大于 3 个。

Prometheus保存块数据的目录结构如下所示：

./data |- 01BKGV7JBM69T2G1BGBGM6KB12 # 块|- meta.json  # 元数据|- wal        # 写入日志|- 000002|- 000001|- 01BKGTZQ1SYQJTR4PB43C8PD98  # 块|- meta.json  #元数据|- index   # 索引文件|- chunks  # 样本数据|- 000001|- tombstones # 逻辑数据|- 01BKGTZQ1HHWHV8FBJXW1Y3W0K|- meta.json|- wal|-000001

最初两个小时的块最终会在后台被压缩成更长的块。

[info] 注意
本地存储不可复制，无法构建集群，如果本地磁盘或节点出现故障，存储将无法扩展和迁移。因此我们只能把本地存储视为近期数据的短暂滑动窗口。如果你对数据持久化的要求不是很严格，可以使用本地磁盘存储多达数年的数据。

远程存储

Prometheus 的本地存储无法持久化数据，无法灵活扩展。为了保持Prometheus的简单性，Prometheus并没有尝试在自身中解决以上问题，而是通过定义两个标准接口（remote_write/remote_read），让用户可以基于这两个接口对接将数据保存到任意第三方的存储服务中，这种方式在 Promthues 中称为 Remote Storage。

Prometheus 可以通过两种方式来集成远程存储。

Remote Write

用户可以在 Prometheus 配置文件中指定 Remote Write（远程写）的 URL 地址，一旦设置了该配置项，Prometheus 将采集到的样本数据通过 HTTP 的形式发送给适配器（Adaptor）。而用户则可以在适配器中对接外部任意的服务。外部服务可以是真正的存储系统，公有云的存储服务，也可以是消息队列等任意形式。

Remote Read

如下图所示，Promthues 的 Remote Read（远程读）也通过了一个适配器实现。在远程读的流程当中，当用户发起查询请求后，Promthues 将向 remote_read 中配置的 URL 发起查询请求（matchers,ranges），Adaptor 根据请求条件从第三方存储服务中获取响应的数据。同时将数据转换为 Promthues 的原始样本数据返回给 Prometheus Server。

当获取到样本数据后，Promthues 在本地使用 PromQL 对样本数据进行二次处理。

[info] 注意
启用远程读设置后，Prometheus 仅从远程存储读取一组时序样本数据（根据标签选择器和时间范围），对于规则文件的处理，以及 Metadata API 的处理都只基于 Prometheus 本地存储完成。这也就意味着远程读在扩展性上有一定的限制，因为所有的样本数据都要首先加载到 Prometheus Server，然后再进行处理。所以 Prometheus 暂时不支持完全分布式处理。

远程读和远程写协议都使用了基于 HTTP 的 snappy 压缩协议的缓冲区编码，目前还不稳定，在以后的版本中可能会被替换成基于 HTTP/2 的 gRPC 协议，前提是 Prometheus 和远程存储之间的所有通信都支持 HTTP/2。

支持的远程存储

目前 Prometheus 社区也提供了部分对于第三方数据库的 Remote Storage 支持：

存储服务	支持模式
AppOptics	write
Chronix	write
Cortex	read/write
CrateDB	read/write
Elasticsearch	write
Gnocchi	write
Graphite	write
InfluxDB	read/write
IRONdb	read/write
Kafka	write
M3DB	read/write
OpenTSDB	write
PostgreSQL/TimescaleDB	read/write
SignalFx	write
Splunk	write
TiKV	read/write
VictoriaMetrics	write
Wavefront	write

Prometheus安装配置及其相关组件的应用相关推荐

Tomcat的安装配置及相关问题解决
Apache Tomcat/8.5.82 小声说 Tomcat的安装配置及相关问题解决 Apache Tomcat/8.5.82的下载配置访问Tomcat 常出现的问题及解决 1.tomcat8. ...
Java SE 7 Update 17的安装配置及相关问题解决
Java SE 7 Update 17 小声说 Eclipse Java EE集成开发环境下载安装配置总结小声说首先非常感谢大家的认可,近一周来收到些私信,有些没及时回复十分不好意思,但是 ...
phpmyadmin安装配置以及相关问题
phpmyadmin配置文件: 配置提示:几乎所有的配置参数都在 config.inc.php 文件中. 如果这个文件不存在,您可以在 libraries 目录中找到 config.default.p ...
监控神器普罗米修斯Prometheus安装配置
一.基础环境环境/组件版本下载地址操作系统 CentOS 7.3 http://archive.kernel.org/centos-vault/7.3.1611/isos/x86_64/Cen ...
linux 环境安装配置,Linux相关环境安装与配置
最近买了一个腾讯云的服务器,简单安装了下环境并记录一下 Java: 1.查找java相关的列表 yum -y list java* 2.安装JDK(这里是1.8.0版本) yum install ja ...
sqlserver2017 +SSMS+ VS2017+SSDT 安装要点及相关组件下载地址
1.sqlserver2017安装PolyBase需要安装jdk7 ,注意必须是7 jdk10是不行的. 下载地址:http://dl-t1.wmzhe.com/30/30117/jdk_7u_1. ...
Flume篇---Flume安装配置与相关使用
一.前述 Copy过来一段介绍Apache Flume 是一个从可以收集例如日志,事件等数据资源,并将这些数量庞大的数据从各项数据资源中集中起来存储的工具/服务,或者数集中机制.flume具有高可用, ...
phpstudy安装sg11组件_宝塔面板一键安装配置SG11加密组件教程
宝塔面板用户安装SG11组件教程以下教程适用于宝塔 windows系统 6.x 以上面板和 linux系统 7.x 以上版本 1.进入宝塔面板,点击软件管理,找到你网站对应的php如下图操作 2 ...
OCS2007安装配置指南
[url]http://blog.csdn.net/zengqf/archive/2008/12/26/3616718.aspx[/url] 这几天在空空BLog的基础上,总算是把OCS2007成功部 ...
airflow mysql_AirFlow 安装配置
airflow 安装配置 airflow 相关软件安装 python 3.6.5 安装安装依赖程序 : [root@node01 ~]# yum -y install zlib zlib-devel ...

Prometheus安装配置及其相关组件的应用