使用 Prometheus 进行应用监控

应用监控的定义与作用

对于工程师们来说，软件某一阶段的开发任务完成就意味着"done"了。然而从软件的生命周期来说，编码的完成只是开始，软件还需要符合预期地运行并试图达到人们对它设定的目标，而监控就是检验这两点的常用可视化方法。

按照监控的对象通常可以将监控分为基础设施监控，中间件监控，应用监控和业务监控，它们各自的监控对象与作用如下表所示：

	监控对象	判断软件是否符合预期地运行	判断业务目标是否达到
基础设施监控	服务器、存储等软件的运行环境	√
中间件监控	数据库、消息队列等公用软件	√
应用监控	实现具体业务需求的软件	√	√
业务监控	业务指标		√

其中基础设施、中间件和应用层级的监控，由于都在软硬件系统内，它们之中任意一环出现问题都有可能导致软件运行出现异常，实际场景中这些监控通常需要互相配合、关联分析。

而应用级别的监控，由于本身是业务的重要载体，应用监控有时也能直接反应业务指标是否达到，比如应用的吞吐量和时延等指标，当这些指标是业务的侧重点时，应用监控实际上就发挥了业务监控的作用。

应用监控利器 Prometheus

Prometheus是一套开源的监控体系，以指标为度量单位来描述应用的运行情况。

组件及生态

这张图片是Prometheus官网上的架构图，可以看到 Prometheus 除了主要的 Prometheus Server 提供采集和存储指标的时序数据以外，还包括接受推送方式获取指标的 Pushgateway 和管理告警规则的 Alertmanger 等组件。在应用监控中我们主要关注的是 Prometheus Server 和集成到各类应用中负责产生指标的 prometheus client 库。

特性

Prometheus官网中介绍的特性主要有以下几点：

多维度的指标数据模型(prometheus中每条时序数据包括时间戳，指标名称和标签等维度)
指标查询语言PromQL（通过对原始指标数据进行标签筛选，以及取变化率、topN等数据处理函数操作，使得指标的表达更具灵活性）
不依赖于分布式存储，实现单节点自治
基于HTTP协议拉取时序数据（相比于Zabbix中使用json-rpc协议，HTTP协议更符合Web应用中远程调用的主流）

在k8s集群中部署 Prometheus Operator

Prometheus Operator 是在 k8s 集群中部署和维护 prometheus 服务的一种方式，它在 prometheus server 和 alertmanger 等服务端组件的基础上，还把监控对象和告警规则等配置也给"资源化"了，更容易在 k8s 集群中管理。

演示项目 kube-prometheus

Github 中的 kube-prometheus 项目是从 prometheus operator 中分离出来的，主要用来快速搭建一个演示环境。

直接通过 kuebctl apply 命令在k8s集群中部署 prometheus operator:

git clone https://github.com/coreos/kube-prometheus.git
cd kube-prometheus/manifests
kubectl apply -f setup
kubeclt apply -f .

以上命令中的 kubectl apply -f setup 主要创建 monitoring 命名空间，以及prometheus operator 这个控制器还有其他 CRD (自定义资源声明)。

kubectl apply -f .则创建刚才定义好的自定义资源，以及各自的 ServiceAccount 和相关配置。

各自定义资源的拓扑关系如下：

ServiceMonitor的作用

其中跟应用监控关系最密切的就是ServiceMonitor资源，它的yaml文件类似这样：

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:labels:k8s-app: alertmanagername: alertmanagernamespace: monitoring
spec:endpoints:- interval: 30sport: webselector:matchLabels:alertmanager: main

ServiceMonitor通过标签筛选需要被监控的对象(k8s service)，并且指定从它的哪个端口和url路径来拉取指标，并定义拉取操作的间隔时间。

ServiceMonitor本质是对 prometheus 配置中指标数据来源(endpoint)的抽象，每新建一个 service monitor 资源，prometheus operator 就会在自定义资源 promethues 的配置文件中添加相应的配置，以达到和使用原生 prometheus 相同的效果，这就把原来需要需要手动和统一配置的任务通过 crd 来自动化实现了。

上述例子中几乎所有资源都有对应的 Service Monitor，说明它们都有一个http/https 接口来暴露指标数据，这是因为谷歌还有coreos在设计这些组件时就要求每个组件都要有暴露自身状态的接口，它们在样式上是符合Prometheus 规范的文本信息，类似这样：

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0.0008235
go_gc_duration_seconds_sum 0.0008235
go_gc_duration_seconds_count 8

每一个或一组指标都会包含描述信息、指标类型、指标名称和实时数据这些，prometheus在获取http接口中的文本信息后会进一步将它们转化为自己的时序数据模型(加上时间戳维度等)。

Go应用中实现自定义指标

在我们自己开发的应用程序中，也可以通过集成 prometheus 官方提供的client库来对外暴露 prometheus 风格的指标数据。

以 Go 语言开发的应用为例，首先在项目中新建一个子目录(包)用来声明和注册需要对外暴露的应用指标：

stat/prometheus.go:

package statimport "github.com/prometheus/client_golang/prometheus"var (testRequestCounter = prometheus.NewCounter(prometheus.CounterOpts{Name: "test_request_total",Help: "Total count of test request",})
)func init() {prometheus.MustRegister(testRequestCounter)
}

示例中声明了一个 Counter 类型的指标，用于统计测试http请求的总数，在 init 函数中对该指标进行注册，之后我们在其他go文件中引入该包时就会自动注册这些指标。

Counter类型是 prometheus 4种指标类型(Counter, Gauge, Histogram, Summary)的一种，用于描述只增不减的数据，比如http服务接收的请求数，具体可以查看 prometheus 的官方文档。

main.go:

package mainimport ("github.com/gin-gonic/gin""github.com/prometheus/client_golang/prometheus""github.com/prometheus/client_golang/prometheus/promhttp""github.com/go-app/stat" //刚才写的声明指标的包
)
func main() {r := gin.Default()r.GET("metrics",gin.WrapH(promhttp.Handler()))r.GET("test",func(c *gin.Context) {stat.testRequestCounter.Inc()c.JSON(http.StatusOK, gin.H{"text": "hello world",})})r.Run(":9090")
}

指标声明后还需要在合适的实际触发对指标数据的采集，比如这个例子中在每次访问 /test 请求时在 handle 函数中使请求计数器加1，如果是要统计所有的请求数的话，还可以把采集数据的操作放在中间件中，使任何请求都会触发计数器加1。

实际场景中的应用监控比上述例子复杂得多，因为不同的应用程序可以采集的监控指标不同，即使是同类型的应用，在不同的业务场景下需要采集的指标也会有不同的侧重。但是在谷歌的 SRE 实践中仍然总结出了4个黄金指标，分别是：

延迟 服务处理请求所需要的时间
流量 对系统负载的度量，在http服务中通常是每秒的请求数
错误 请求失败的速率
饱和度 服务容量有多"满"，通常是系统中某个最为受限的资源的某个具体指标的度量

这些指标也可以通过 prometheus 的4中基本指标类型去表示，大致的关系是：

Counter ==> 请求量，请求流量
Gauge ==> 系统的饱和度(实时)
Histogram ==> 请求延时在各个区间的分布
Summary ==> 请求延时的中位数，9分位数，95分位数等

根据黄金指标的指导理念，我又设计了一个更复杂一些的示例：
假设有一个固定容量的消息队列，通过http的 /push 和 /pop 请求可以使队列增加或减少一条记录，在队列容量快要满或者快要全空的时候，请求的延时和错误率都会相应增加。

以下是完整的示例代码和最终展示的Grafana图表，仅供参考：
stat/prometheus.go:

package statimport "github.com/prometheus/client_golang/prometheus"var (MqRequestCounter = prometheus.NewCounterVec(prometheus.CounterOpts{Namespace: "mq",Name: "request_total",Help: "Total count of success request",},[]string{"direction"})MqErrRequestCounter = prometheus.NewCounterVec(prometheus.CounterOpts{Namespace: "mq",Name: "err_request_total",Help: "Total count of failed request",},[]string{"direction"})MqRequestDurationHistogram = prometheus.NewHistogramVec(prometheus.HistogramOpts{Namespace: "mq",Name: "request_duration_distribution",Help: "Distribution state of request duration",Buckets: prometheus.LinearBuckets(110,10,5),},[]string{"direction"})MqRequestDurationSummary = prometheus.NewSummaryVec(prometheus.SummaryOpts{Namespace: "mq",Name: "request_duration_quantiles",Help: "Quantiles of request duration",Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},},[]string{"direction"})MqCapacitySaturation = prometheus.NewGauge(prometheus.GaugeOpts{Namespace: "mq",Name: "capacity_saturation",Help: "Capacity saturation of the message queue",})
)func init() {prometheus.MustRegister(MqRequestCounter,MqErrRequestCounter,MqRequestDurationHistogram,MqRequestDurationSummary,MqCapacitySaturation)
}

main.go:

package mainimport ("fmt""github.com/gin-gonic/gin""github.com/prometheus/client_golang/prometheus""github.com/prometheus/client_golang/prometheus/promhttp""github.com/go-app/stat""math/rand""net/http""strings""time"
)type queueConfig struct {length intmaxErrorRate float64maxDuration time.Duration
}type messageQueue struct {qc queueConfigqueue []string
}var pushErrCount, popErrCount intfunc (m *messageQueue) push() (ok bool,duration time.Duration){startTime := time.Now()stat.MqRequestCounter.With(prometheus.Labels{"direction":"push"}).Inc()factor := float64(len(m.queue))/float64(m.qc.length)fixedDuration := time.Duration(float64(m.qc.maxDuration)*factor) + time.Millisecond*100time.Sleep(fixedDuration)errorRate := m.qc.maxErrorRate * factorif rand.Intn(100) < int(errorRate*100) {ok = falsepushErrCount += 1stat.MqErrRequestCounter.With(prometheus.Labels{"direction":"push"}).Inc()} else {ok = truem.queue = append(m.queue,"#")}duration = time.Now().Sub(startTime)durationMs := float64(duration/time.Millisecond)stat.MqRequestDurationHistogram.With(prometheus.Labels{"direction":"push"}).Observe(durationMs)stat.MqRequestDurationSummary.With(prometheus.Labels{"direction":"push"}).Observe(durationMs)fmt.Printf("%v",strings.Join(m.queue,""))fmt.Printf("\t Factor: %v Success:%v Duration:%v PushErrCount:%v\n",factor,ok,duration,pushErrCount)return
}func (m *messageQueue) pop() (ok bool,duration time.Duration){startTime := time.Now()stat.MqRequestCounter.With(prometheus.Labels{"direction":"pop"}).Inc()factor := float64(m.qc.length-len(m.queue))/float64(m.qc.length)fixedDuration := time.Duration(float64(m.qc.maxDuration)*factor) + time.Millisecond*100time.Sleep(fixedDuration)errorRate := m.qc.maxErrorRate * factorif rand.Intn(100) < int(errorRate*100) {ok = falsepopErrCount += 1stat.MqErrRequestCounter.With(prometheus.Labels{"direction":"pop"}).Inc()} else {ok = truem.queue = m.queue[:len(m.queue)-1]}duration = time.Now().Sub(startTime)durationMs := float64(duration/time.Millisecond)stat.MqRequestDurationHistogram.With(prometheus.Labels{"direction":"pop"}).Observe(durationMs)stat.MqRequestDurationSummary.With(prometheus.Labels{"direction":"pop"}).Observe(durationMs)fmt.Printf("%v",strings.Join(m.queue,""))fmt.Printf("\t Factor: %v Success:%v Duration:%v PopErrCount:%v\n",factor,ok,duration,popErrCount)return
}func main() {r := gin.Default()r.GET("metrics",gin.WrapH(promhttp.Handler()))api := r.Group("api")qc := queueConfig{length: 100,maxErrorRate: 0.2,maxDuration: 50*time.Millisecond,}mq := messageQueue{qc: qc,queue: make([]string,0,qc.length),}rand.Seed(time.Now().UnixNano())api.POST("push",func(c *gin.Context) {ok, duration := mq.push()c.JSON(http.StatusOK,gin.H{"success": ok,"duration": duration,"length": len(mq.queue),})})api.POST("pop",func(c *gin.Context) {ok, duration := mq.pop()c.JSON(http.StatusOK,gin.H{"success": ok,"duration": duration,"length": len(mq.queue),})})go func() {for {saturation := float64(len(mq.queue))/float64(mq.qc.length)stat.MqCapacitySaturation.Set(saturation)time.Sleep(time.Second*5)}}()r.Run(":9090")
}

client.go (用于持续对应用发起请求):

package mainimport ("encoding/json""fmt""io/ioutil""net/http""os""time"
)type Body struct {Success  bool          `json:"success"`Duration time.Duration `json:"duration"`Length   int           `json:"length"`
}func main() {pushFlag := truevar baseUrl stringif len(os.Args[1:]) == 0 {baseUrl = "http://localhost:9090"} else {baseUrl = os.Args[1:][0]}pushUrl := baseUrl + "/api/push"popUrl := baseUrl + "/api/pop"for {if pushFlag {resp, err := http.Post(pushUrl, "application/json", nil)if err != nil {fmt.Fprint(os.Stderr, err)}bodyStr, err := ioutil.ReadAll(resp.Body)if err != nil {fmt.Fprint(os.Stderr, err)}body := &Body{}err = json.Unmarshal(bodyStr, body)if err != nil {fmt.Fprint(os.Stderr, err)}fmt.Printf("%v\n", body)resp.Body.Close()if body.Length == 100 {pushFlag = false}} else {resp, err := http.Post(popUrl, "application/json", nil)if err != nil {fmt.Fprint(os.Stderr, err)}bodyStr, err := ioutil.ReadAll(resp.Body)if err != nil {fmt.Fprint(os.Stderr, err)}body := &Body{}err = json.Unmarshal(bodyStr, body)if err != nil {fmt.Fprint(os.Stderr, err)}fmt.Printf("%v\n", body)resp.Body.Close()if body.Length == 0 {pushFlag = true}}}
}

Grafana图表：