k8s实战之部署Prometheus+Grafana可视化监控告警平台
写在前面
之前部署web网站的时候,架构图中有一环节是监控部分,并且搭建一套有效的监控平台对于运维来说非常之重要,只有这样才能更有效率的保证我们的服务器和服务的稳定运行,常见的开源监控软件有好几种,如zabbix、Nagios、open-flcon还有prometheus,每一种有着各自的优劣势,感兴趣的童鞋可以自行百度,但是与k8s集群监控,相对于而已更加友好的是Prometheus,今天我们就看看如何部署一套Prometheus全方位监控K8S
主要内容
1.Prometheus架构
2.K8S监控指标及实现思路
3.在K8S平台部署Prometheus
4.基于K8S服务发现的配置解析
5.在K8S平台部署Grafana
6.监控K8S集群中Pod、Node、资源对象
7.使用Grafana可视化展示Prometheus监控数据
8.告警规则与告警通知
1 Prometheus架构
Prometheus 是什么
Prometheus(普罗米修斯)是一个最初在SoundCloud上构建的监控系统。自2012年成为社区开源项目,拥有非常活跃的开发人员和用户社区。为强调开源及独立维护,Prometheus于2016年加入云原生云计算基金会(CNCF),成为继Kubernetes之后的第二个托管项目。
官网地址:
https://prometheus.io
https://github.com/prometheus
Prometheus 组成及架构
- Prometheus Server:收集指标和存储时间序列数据,并提供查询接口
- ClientLibrary:客户端库
- Push Gateway:短期存储指标数据。主要用于临时性的任务
- Exporters:采集已有的第三方服务监控指标并暴露metrics
- Alertmanager:告警
- Web UI:简单的Web控制
数据模型
Prometheus将所有数据存储为时间序列;具有相同度量名称以及标签属于同一个指标。
每个时间序列都由度量标准名称和一组键值对(也成为标签)唯一标识。
时间序列格式:
{=, …}
示例:api_http_requests_total{method=”POST”, handler=”/messages”}
作业和实例
实例:可以抓取的目标称为实例(Instances)
作业:具有相同目标的实例集合称为作业
(Job)
scrape_configs:
-job_name: ‘prometheus’
static_configs:
-targets: [‘localhost:9090’]
-job_name: ‘node’
static_configs:
-targets: [‘192.168.1.10:9090’]
2 K8S监控指标及实现思路
k8S监控指标
Kubernetes本身监控
- Node资源利用率
- Node数量
- Pods数量(Node)
- 资源对象状态
- Pod监控
Pod数量(项目)
容器资源利用率
应用程序
Prometheus监控K8S实现的架构
服务发现:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
3 在K8S平台部署Prometheus
3.1集群环境
3.2 项目地址
可以直接选择git clone项目代码,也可以后面自己编写,后面的操作也会有实现代码
[root@k8s-master src]# git clone https://github.com/zhangdongdong7/k8s-prometheus.git
Cloning into 'k8s-prometheus'...
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
[root@k8s-master src]# cd k8s-prometheus/
[root@k8s-master k8s-prometheus]# ls
alertmanager-configmap.yaml kube-state-metrics-rbac.yaml prometheus-rbac.yaml
alertmanager-deployment.yaml kube-state-metrics-service.yaml prometheus-rules.yaml
alertmanager-pvc.yaml node_exporter-0.17.0.linux-amd64.tar.gz prometheus-service.yaml
alertmanager-service.yaml node_exporter.sh prometheus-statefulset-static-pv.yaml
grafana.yaml OWNERS prometheus-statefulset.yaml
kube-state-metrics-deployment.yaml prometheus-configmap.yaml README.md
3.3 使用RBAC进行授权
RBAC(Role-Based Access Control,基于角色的访问控制):负责完成授权(Authorization)工作。
编写授权yaml
[root@k8s-master prometheus-k8s]# vim prometheus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: prometheusnamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:name: prometheuslabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
rules:- apiGroups:- ""resources:- nodes- nodes/metrics- services- endpoints- podsverbs:- get- list- watch- apiGroups:- ""resources:- configmapsverbs:- get- nonResourceURLs:- "/metrics"verbs:- get
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:name: prometheuslabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: prometheus
subjects:
- kind: ServiceAccountname: prometheusnamespace: kube-system
创建
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-rbac.yaml
3.4 配置管理
使用Configmap保存不需要加密配置信息
其中需要把nodes中ip地址根据自己的地址进行修改,其他不需要改动
[root@k8s-master prometheus-k8s]# vim prometheus-configmap.yaml
# Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/
apiVersion: v1
kind: ConfigMap #
metadata:name: prometheus-confignamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: EnsureExists
data:prometheus.yml: |rule_files:- /etc/config/rules/*.rulesscrape_configs:- job_name: prometheusstatic_configs:- targets:- localhost:9090- job_name: kubernetes-nodesscrape_interval: 30sstatic_configs:- targets:- 192.168.73.135:9100 #ip地址根据自己的地址进行修改- 192.168.73.138:9100- 192.168.73.139:9100- 192.168.73.140:9100- job_name: kubernetes-apiserverskubernetes_sd_configs:- role: endpointsrelabel_configs:- action: keepregex: default;kubernetes;httpssource_labels:- __meta_kubernetes_namespace- __meta_kubernetes_service_name- __meta_kubernetes_endpoint_port_namescheme: httpstls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token- job_name: kubernetes-nodes-kubeletkubernetes_sd_configs:- role: noderelabel_configs:- action: labelmapregex: __meta_kubernetes_node_label_(.+)scheme: httpstls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token- job_name: kubernetes-nodes-cadvisorkubernetes_sd_configs:- role: noderelabel_configs:- action: labelmapregex: __meta_kubernetes_node_label_(.+)- target_label: __metrics_path__replacement: /metrics/cadvisorscheme: httpstls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token- job_name: kubernetes-service-endpointskubernetes_sd_configs:- role: endpointsrelabel_configs:- action: keepregex: truesource_labels:- __meta_kubernetes_service_annotation_prometheus_io_scrape- action: replaceregex: (https?)source_labels:- __meta_kubernetes_service_annotation_prometheus_io_schemetarget_label: __scheme__- action: replaceregex: (.+)source_labels:- __meta_kubernetes_service_annotation_prometheus_io_pathtarget_label: __metrics_path__- action: replaceregex: ([^:]+)(?::\d+)?;(\d+)replacement: $1:$2source_labels:- __address__- __meta_kubernetes_service_annotation_prometheus_io_porttarget_label: __address__- action: labelmapregex: __meta_kubernetes_service_label_(.+)- action: replacesource_labels:- __meta_kubernetes_namespacetarget_label: kubernetes_namespace- action: replacesource_labels:- __meta_kubernetes_service_nametarget_label: kubernetes_name- job_name: kubernetes-serviceskubernetes_sd_configs:- role: servicemetrics_path: /probeparams:module:- http_2xxrelabel_configs:- action: keepregex: truesource_labels:- __meta_kubernetes_service_annotation_prometheus_io_probe- source_labels:- __address__target_label: __param_target- replacement: blackboxtarget_label: __address__- source_labels:- __param_targettarget_label: instance- action: labelmapregex: __meta_kubernetes_service_label_(.+)- source_labels:- __meta_kubernetes_namespacetarget_label: kubernetes_namespace- source_labels:- __meta_kubernetes_service_nametarget_label: kubernetes_name- job_name: kubernetes-podskubernetes_sd_configs:- role: podrelabel_configs:- action: keepregex: truesource_labels:- __meta_kubernetes_pod_annotation_prometheus_io_scrape- action: replaceregex: (.+)source_labels:- __meta_kubernetes_pod_annotation_prometheus_io_pathtarget_label: __metrics_path__- action: replaceregex: ([^:]+)(?::\d+)?;(\d+)replacement: $1:$2source_labels:- __address__- __meta_kubernetes_pod_annotation_prometheus_io_porttarget_label: __address__- action: labelmapregex: __meta_kubernetes_pod_label_(.+)- action: replacesource_labels:- __meta_kubernetes_namespacetarget_label: kubernetes_namespace- action: replacesource_labels:- __meta_kubernetes_pod_nametarget_label: kubernetes_pod_namealerting:alertmanagers:- static_configs:- targets: ["alertmanager:80"]
创建
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-configmap.yaml
3.5 有状态部署prometheus
这里使用storageclass进行动态供给,给prometheus的数据进行持久化,具体实现办法,可以查看之前的文章《k8s中的NFS动态存储供给》,除此之外可以使用静态供给的prometheus-statefulset-static-pv.yaml进行持久化
[root@k8s-master prometheus-k8s]# vim prometheus-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:name: prometheusnamespace: kube-systemlabels:k8s-app: prometheuskubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcileversion: v2.2.1
spec:serviceName: "prometheus"replicas: 1podManagementPolicy: "Parallel"updateStrategy:type: "RollingUpdate"selector:matchLabels:k8s-app: prometheustemplate:metadata:labels:k8s-app: prometheusannotations:scheduler.alpha.kubernetes.io/critical-pod: ''spec:priorityClassName: system-cluster-criticalserviceAccountName: prometheusinitContainers:- name: "init-chown-data"image: "busybox:latest"imagePullPolicy: "IfNotPresent"command: ["chown", "-R", "65534:65534", "/data"]volumeMounts:- name: prometheus-datamountPath: /datasubPath: ""containers:- name: prometheus-server-configmap-reloadimage: "jimmidyson/configmap-reload:v0.1"imagePullPolicy: "IfNotPresent"args:- --volume-dir=/etc/config- --webhook-url=http://localhost:9090/-/reloadvolumeMounts:- name: config-volumemountPath: /etc/configreadOnly: trueresources:limits:cpu: 10mmemory: 10Mirequests:cpu: 10mmemory: 10Mi- name: prometheus-serverimage: "prom/prometheus:v2.2.1"imagePullPolicy: "IfNotPresent"args:- --config.file=/etc/config/prometheus.yml- --storage.tsdb.path=/data- --web.console.libraries=/etc/prometheus/console_libraries- --web.console.templates=/etc/prometheus/consoles- --web.enable-lifecycleports:- containerPort: 9090readinessProbe:httpGet:path: /-/readyport: 9090initialDelaySeconds: 30timeoutSeconds: 30livenessProbe:httpGet:path: /-/healthyport: 9090initialDelaySeconds: 30timeoutSeconds: 30# based on 10 running nodes with 30 pods eachresources:limits:cpu: 200mmemory: 1000Mirequests:cpu: 200mmemory: 1000MivolumeMounts:- name: config-volumemountPath: /etc/config- name: prometheus-datamountPath: /datasubPath: ""- name: prometheus-rulesmountPath: /etc/config/rulesterminationGracePeriodSeconds: 300volumes:- name: config-volumeconfigMap:name: prometheus-config- name: prometheus-rulesconfigMap:name: prometheus-rulesvolumeClaimTemplates:- metadata:name: prometheus-dataspec:storageClassName: managed-nfs-storage #存储类根据自己的存储类名字修改accessModes:- ReadWriteOnceresources:requests:storage: "16Gi"
创建
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-statefulset.yaml
检查状态
[root@k8s-master prometheus-k8s]# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-5bd5f9dbd9-wv45t 1/1 Running 1 8d
kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 14d
prometheus-0 2/2 Running 6 14d
可以看到一个prometheus-0的pod,这就刚才使用statefulset控制器进行的有状态部署,两个容器的状态为Runing则是正常,如果不为Runing可以使用kubectl describe pod prometheus-0 -n kube-system查看报错详情
3.6 创建service暴露访问端口
此处使用nodePort固定一个访问端口,不适用随机端口,便于访问
[root@k8s-master prometheus-k8s]# vim prometheus-service.yaml
kind: Service
apiVersion: v1
metadata:name: prometheusnamespace: kube-systemlabels:kubernetes.io/name: "Prometheus"kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
spec:type: NodePortports:- name: httpport: 9090protocol: TCPtargetPort: 9090nodePort: 30090 #固定的对外访问的端口selector:k8s-app: prometheus
创建
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-service.yaml
检查
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system
NAME READY STATUS RESTARTS AGE
pod/coredns-5bd5f9dbd9-wv45t 1/1 Running 1 8dpod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 14dpod/prometheus-0 2/2 Running 6 14dNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 13dservice/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 16dservice/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 13d
3.7 web访问
使用任意一个NodeIP加端口进行访问,访问地址:http://NodeIP:Port ,此例就是:http://192.168.73.139:30090
访问成功的界面如图所示:
4 在K8S平台部署Grafana
通过上面的web访问,可以看出prometheus自带的UI界面是没有多少功能的,可视化展示的功能不完善,不能满足日常的监控所需,因此常常我们需要再结合Prometheus+Grafana的方式来进行可视化的数据展示
官网地址:
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus
https://grafana.com/grafana/download
刚才下载的项目中已经写好了Grafana的yaml,根据自己的环境进行修改
4.1 使用StatefulSet部署grafana
[root@k8s-master prometheus-k8s]# vim grafana.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:name: grafananamespace: kube-system
spec:serviceName: "grafana"replicas: 1selector:matchLabels:app: grafanatemplate:metadata:labels:app: grafanaspec:containers:- name: grafanaimage: grafana/grafanaports:- containerPort: 3000protocol: TCPresources:limits:cpu: 100mmemory: 256Mirequests:cpu: 100mmemory: 256MivolumeMounts:- name: grafana-datamountPath: /var/lib/grafanasubPath: grafanasecurityContext:fsGroup: 472runAsUser: 472volumeClaimTemplates:- metadata:name: grafana-dataspec:storageClassName: managed-nfs-storage #和prometheus使用同一个存储类accessModes:- ReadWriteOnceresources:requests:storage: "1Gi"---apiVersion: v1
kind: Service
metadata:name: grafananamespace: kube-system
spec:type: NodePortports:- port : 80targetPort: 3000nodePort: 30091selector:app: grafana
4.2 Grafana的web访问
使用任意一个NodeIP加端口进行访问,访问地址:http://NodeIP:Port ,此例就是:http://192.168.73.139:30091
成功访问界面如下,会需要进行账号密码登陆,默认账号密码都为admin,登陆之后会让修改密码
登陆之后的界面如下
第一步需要进行数据源添加,点击create your first data source数据库图标,根据下图所示进行添加即可
第二步,添加完了之后点击底部的绿色的Save&Test,会成功提示Data sourse is working,则表示数据源添加成功
4.3 监控K8S集群中Pod、Node、资源对象数据的方法
1)Pod
kubelet的节点使用cAdvisor提供的metrics接口获取该节点所有Pod和容器相关的性能指标数据,安装kubelet默认就开启了
暴露接口地址:
https://NodeIP:10255/metrics/cadvisor
https://NodeIP:10250/metrics/cadvisor
2)Node
需要使用node_exporter收集器采集节点资源利用率。
https://github.com/prometheus/node_exporter
使用文档:https://prometheus.io/docs/guides/node-exporter/
使用node_exporter.sh脚本分别在所有服务器上部署node_exporter收集器,不需要修改可直接运行脚本
[root@k8s-master prometheus-k8s]# cat node_exporter.sh #!/bin/bashwget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gztar zxf node_exporter-0.17.0.linux-amd64.tar.gz
mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exportercat <<EOF >/usr/lib/systemd/system/node_exporter.service
[Unit]
Description=https://prometheus.io[Service]
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service[Install]
WantedBy=multi-user.target
EOFsystemctl daemon-reload
systemctl enable node_exporter
systemctl restart node_exporter
[root@k8s-master prometheus-k8s]# ./node_exporter.sh
检测node_exporter的进程,是否生效
[root@k8s-master prometheus-k8s]# ps -ef|grep node_exporter
root 6227 1 0 Oct08 ? 00:06:43 /usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service
root 118269 117584 0 23:27 pts/0 00:00:00 grep --color=auto node_exporter
3)资源对象
kube-state-metrics采集了k8s中各种资源对象的状态信息,只需要在master节点部署就行
https://github.com/kubernetes/kube-state-metrics
创建rbac的yaml对metrics进行授权
[root@k8s-master prometheus-k8s]# vim kube-state-metrics-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:name: kube-state-metricsnamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: kube-state-metricslabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups: [""]resources:- configmaps- secrets- nodes- pods- services- resourcequotas- replicationcontrollers- limitranges- persistentvolumeclaims- persistentvolumes- namespaces- endpointsverbs: ["list", "watch"]
- apiGroups: ["extensions"]resources:- daemonsets- deployments- replicasetsverbs: ["list", "watch"]
- apiGroups: ["apps"]resources:- statefulsetsverbs: ["list", "watch"]
- apiGroups: ["batch"]resources:- cronjobs- jobsverbs: ["list", "watch"]
- apiGroups: ["autoscaling"]resources:- horizontalpodautoscalersverbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:name: kube-state-metrics-resizernamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
rules:
- apiGroups: [""]resources:- podsverbs: ["get"]
- apiGroups: ["extensions"]resources:- deploymentsresourceNames: ["kube-state-metrics"]verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: kube-state-metricslabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: kube-state-metrics
subjects:
- kind: ServiceAccountname: kube-state-metricsnamespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:name: kube-state-metricsnamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
roleRef:apiGroup: rbac.authorization.k8s.iokind: Rolename: kube-state-metrics-resizer
subjects:
- kind: ServiceAccountname: kube-state-metricsnamespace: kube-system
[root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-rbac.yaml
编写Deployment和ConfigMap的yaml进行metrics pod部署,不需要进行修改
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:name: kube-state-metricsnamespace: kube-systemlabels:k8s-app: kube-state-metricskubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcileversion: v1.3.0
spec:selector:matchLabels:k8s-app: kube-state-metricsversion: v1.3.0replicas: 1template:metadata:labels:k8s-app: kube-state-metricsversion: v1.3.0annotations:scheduler.alpha.kubernetes.io/critical-pod: ''spec:priorityClassName: system-cluster-criticalserviceAccountName: kube-state-metricscontainers:- name: kube-state-metricsimage: lizhenliang/kube-state-metrics:v1.3.0ports:- name: http-metricscontainerPort: 8080- name: telemetrycontainerPort: 8081readinessProbe:httpGet:path: /healthzport: 8080initialDelaySeconds: 5timeoutSeconds: 5- name: addon-resizerimage: lizhenliang/addon-resizer:1.8.3resources:limits:cpu: 100mmemory: 30Mirequests:cpu: 100mmemory: 30Mienv:- name: MY_POD_NAMEvalueFrom:fieldRef:fieldPath: metadata.name- name: MY_POD_NAMESPACEvalueFrom:fieldRef:fieldPath: metadata.namespacevolumeMounts:- name: config-volumemountPath: /etc/configcommand:- /pod_nanny- --config-dir=/etc/config- --container=kube-state-metrics- --cpu=100m- --extra-cpu=1m- --memory=100Mi- --extra-memory=2Mi- --threshold=5- --deployment=kube-state-metricsvolumes:- name: config-volumeconfigMap:name: kube-state-metrics-config
---
# Config map for resource configuration.
apiVersion: v1
kind: ConfigMap
metadata:name: kube-state-metrics-confignamespace: kube-systemlabels:k8s-app: kube-state-metricskubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcile
data:NannyConfiguration: |-apiVersion: nannyconfig/v1alpha1kind: NannyConfiguration
[root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-deployment.yaml
2.编写Service的yaml对metrics进行端口暴露
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-service.yaml
apiVersion: v1
kind: Service
metadata:name: kube-state-metricsnamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcilekubernetes.io/name: "kube-state-metrics"annotations:prometheus.io/scrape: 'true'
spec:ports:- name: http-metricsport: 8080targetPort: http-metricsprotocol: TCP- name: telemetryport: 8081targetPort: telemetryprotocol: TCPselector:k8s-app: kube-state-metrics
[root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-service.yaml
3.检查pod和svc的状态,可以看到正常运行了pod/kube-state-metrics-7c76bdbf68-kqqgd 和对外暴露了8080和8081端口
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system
NAME READY STATUS RESTARTS AGE
pod/alertmanager-5d75d5688f-fmlq6 2/2 Running 0 9dpod/coredns-5bd5f9dbd9-wv45t 1/1 Running 1 9dpod/grafana-0 1/1 Running 2 15dpod/kube-state-metrics-7c76bdbf68-kqqgd 2/2 Running 6 14dpod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 16dpod/prometheus-0 2/2 Running 6 15dNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager ClusterIP 10.0.0.207 <none> 80/TCP 13dservice/grafana NodePort 10.0.0.74 <none> 80:30091/TCP 15dservice/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 14dservice/kube-state-metrics ClusterIP 10.0.0.194 <none> 8080/TCP,8081/TCP 14dservice/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 17dservice/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 14d[root@k8s-master prometheus-k8s]#
5 使用Grafana可视化展示Prometheus监控数据
通常在使用Prometheus采集数据的时候我们需要监控K8S集群中Pod、Node、资源对象,因此我们需要安装对应的插件和资源采集器来提供api进行数据获取,在4.3中我们已经配置好,我们也可以使用Prometheus的UI界面中的Staus菜单下的Target中的各个采集器的状态情况,如图所示:
只有当我们各个Target的状态都是UP状态时,我们可以使用自带的的界面去获取到某一监控项的相关的数据,如图所示:
从上面的图中可以看出Prometheus的界面可视化展示的功能较单一,不能满足需求,因此我们需要结合Grafana来进行可视化展示Prometheus监控数据,在上一章节,已经成功部署了Granfana,因此需要在使用的时候添加dashboard和Panel来设计展示相关的监控项,但是实际上在Granfana社区里面有很多成熟的模板,我们可以直接使用,然后根据自己的环境修改Panel中的查询语句来获取数据
https://grafana.com/grafana/dashboards
推荐模板:
集群资源监控的模板号:3119,如图所示进行添加
当模板添加之后如果某一个Panel不显示数据,可以点击Panel上的编辑,查询PromQL语句,然后去Prometheus自己的界面上进行调试PromQL语句是否可以获取到值,最后调整之后的监控界面如图所示
资源状态监控:6417
同理,添加资源状态的监控模板,然后经过调整之后的监控界面如图所示,可以获取到k8s中各种资源状态的监控展示
Node监控:9276
同理,添加资源状态的监控模板,然后经过调整之后的监控界面如图所示,可以获取到各个node上的基本情况
6 在K8S中部署Alertmanager
6.1 部署Alertmanager的实现步骤
6.2 部署告警
我们以Email来进行实现告警信息的发送
首先需要准备一个发件邮箱,开启stmp发送功能
使用configmap存储告警规则,编写报警规则的yaml文件,可根据自己的实际情况进行修改和添加报警的规则,prometheus比zabbix就麻烦在这里,所有的告警规则需要自己去定义
[root@k8s-master prometheus-k8s]# vim prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: prometheus-rulesnamespace: kube-system
data:general.rules: |groups:- name: general.rulesrules:- alert: InstanceDownexpr: up == 0for: 1mlabels:severity: errorannotations:summary: "Instance {{ $labels.instance }} 停止工作"description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."node.rules: |groups:- name: node.rulesrules:- alert: NodeFilesystemUsageexpr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"- alert: NodeMemoryUsageexpr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 10
0 > 80for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} 内存使用率过高"description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})"- alert: NodeCPUUsageexpr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} CPU使用率过高"description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-rules.yaml
3.编写告警configmap的yaml文件部署,增加alertmanager告警配置,进行配置邮箱发送地址
[root@k8s-master prometheus-k8s]# vim alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:name: alertmanager-confignamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: EnsureExists
data:alertmanager.yml: |global:resolve_timeout: 5msmtp_smarthost: 'xxx.com.cn:25' #登陆邮件进行查看smtp_from: 'xxx@126.com.cn' #根据自己申请的发件邮箱进行配置smtp_auth_username: 'xxx@qq.com.cn'smtp_auth_password: 'xxxxx'receivers:- name: default-receiveremail_configs:- to: "zhangdongdong27459@goldwind.com.cn"route:group_interval: 1mgroup_wait: 10sreceiver: default-receiverrepeat_interval: 1m
[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-configmap.yaml
4.创建PVC进行数据持久化,我这个yaml文件使用的跟Prometheus安装时用的存储类来进行自动供给,需要根据自己的实际情况修改
[root@k8s-master prometheus-k8s]# vim alertmanager-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:name: alertmanagernamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: EnsureExists
spec:storageClassName: managed-nfs-storageaccessModes:- ReadWriteOnceresources:requests:storage: "2Gi"
[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-pvc.yaml
5.编写deployment的yaml来部署alertmanager的pod
[root@k8s-master prometheus-k8s]# vim alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:name: alertmanagernamespace: kube-systemlabels:k8s-app: alertmanagerkubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcileversion: v0.14.0
spec:replicas: 1selector:matchLabels:k8s-app: alertmanagerversion: v0.14.0template:metadata:labels:k8s-app: alertmanagerversion: v0.14.0annotations:scheduler.alpha.kubernetes.io/critical-pod: ''spec:priorityClassName: system-cluster-criticalcontainers:- name: prometheus-alertmanagerimage: "prom/alertmanager:v0.14.0"imagePullPolicy: "IfNotPresent"args:- --config.file=/etc/config/alertmanager.yml- --storage.path=/data- --web.external-url=/ports:- containerPort: 9093readinessProbe:httpGet:path: /#/statusport: 9093initialDelaySeconds: 30timeoutSeconds: 30volumeMounts:- name: config-volumemountPath: /etc/config- name: storage-volumemountPath: "/data"subPath: ""resources:limits:cpu: 10mmemory: 50Mirequests:cpu: 10mmemory: 50Mi- name: prometheus-alertmanager-configmap-reloadimage: "jimmidyson/configmap-reload:v0.1"imagePullPolicy: "IfNotPresent"args:- --volume-dir=/etc/config- --webhook-url=http://localhost:9093/-/reloadvolumeMounts:- name: config-volumemountPath: /etc/configreadOnly: trueresources:limits:cpu: 10mmemory: 10Mirequests:cpu: 10mmemory: 10Mivolumes:- name: config-volumeconfigMap:name: alertmanager-config- name: storage-volumepersistentVolumeClaim:claimName: alertmanager
[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-deployment.yaml
6.创建 alertmanager的service对外暴露的端口
[root@k8s-master prometheus-k8s]# vim alertmanager-service.yaml
apiVersion: v1
kind: Service
metadata:name: alertmanagernamespace: kube-systemlabels:kubernetes.io/cluster-service: "true"addonmanager.kubernetes.io/mode: Reconcilekubernetes.io/name: "Alertmanager"
spec:ports:- name: httpport: 80protocol: TCPtargetPort: 9093selector:k8s-app: alertmanagertype: "ClusterIP"[root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-service.yaml
7.检测部署状态,可以发现pod/alertmanager-5d75d5688f-fmlq6和service/alertmanager正常运行
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/alertmanager-5d75d5688f-fmlq6 2/2 Running 4 10d 172.17.15.2 192.168.73.140 <none> <none>
pod/coredns-5bd5f9dbd9-qxvmz 1/1 Running 0 42m 172.17.33.2 192.168.73.138 <none> <none>
pod/grafana-0 1/1 Running 3 16d 172.17.31.2 192.168.73.139 <none> <none>
pod/kube-state-metrics-7c76bdbf68-hv56m 2/2 Running 0 23h 172.17.15.3 192.168.73.140 <none> <none>
pod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 6 17d 172.17.31.4 192.168.73.139 <none> <none>
pod/prometheus-0 2/2 Running 8 16d 172.17.83.2 192.168.73.135 <none> <none>NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/alertmanager ClusterIP 10.0.0.207 <none> 80/TCP 14d k8s-app=alertmanager
service/grafana NodePort 10.0.0.74 <none> 80:30091/TCP 16d app=grafana
service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 42m k8s-app=kube-dns
service/kube-state-metrics ClusterIP 10.0.0.194 <none> 8080/TCP,8081/TCP 15d k8s-app=kube-state-metrics
service/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 18d k8s-app=kubernetes-dashboard
service/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 15d k8s-app=prometheus
6.3 测试告警发送
登录prometheus默认的web界面,选择Alert菜单,则可以看到刚才我们使用prometheus-rules.yaml定义的的四个告警规则
因为告警规则中定义了一个InstanceDown 的实例,所以我们可以停掉138服务器上的kubelet,来测试是否可以收到报警邮件发
[root@k8s-master prometheus-k8s]# kubectl get node
NAME STATUS ROLES AGE VERSION
192.168.73.135 Ready <none> 18d v1.15.2
192.168.73.138 Ready <none> 17d v1.15.2
192.168.73.139 Ready <none> 19d v1.15.2
192.168.73.140 Ready <none> 19d v1.15.2
[root@k8s-master prometheus-k8s]# systemctl stop kubelet
[root@k8s-master prometheus-k8s]# kubectl get node
NAME STATUS ROLES AGE VERSION
192.168.73.135 NotReady <none> 18d v1.15.2
192.168.73.138 Ready <none> 17d v1.15.2
192.168.73.139 Ready <none> 19d v1.15.2
192.168.73.140 Ready <none> 19d v1.15.2
稍微等一会,我们再刷新刚才的web上的告警规则界面,可以发现InstanceDown 的实例颜色变成粉红色了,并且显示2 active
根据规则等待五分钟之后,我们去刷新刚才配置的告警收件箱,收到了一个封InstanceDown 的邮件提醒,邮件的发送间隔时间可以在alertmanager-configmap.yaml配置文件中进行设置,恢复刚才停止的kubelet,将不会收到告警邮件提醒
如果各位大佬看到这里,相信已经对于如何在k8s集群中部署一套Prometheus监控平台来可视化进行数据的展示和告警有了一定了解,读万卷书,不如行万里路,还是希望大家能够自己多去实操
k8s实战之部署Prometheus+Grafana可视化监控告警平台相关推荐
- Prometheus+Grafana可视化监控SpringBoot项目
原文地址:https://xeblog.cn/articles/7 Prometheus简介 简史 Prometheus受启发于Google的Brogmon监控系统(相似的Kubernetes是从Go ...
- 集群外独立部署Prometheus+Grafana监控K8S全面解析
简介 Prometheus+Grafana作为监控K8S的解决方案,大都是在K8S集群内部部署,这样可以直接调用集群内的cert及各种监控url,但是增加了集群的资源开销.因此在资源有限的情况下,我更 ...
- 视频教程-Prometheus+Grafana企业级监控实战(运维篇)2020年视频教程-Linux
Prometheus+Grafana企业级监控实战(运维篇)2020年视频教程 资深DevOps工程师,曾经在华为,乐逗游戏工作,目前就职于知名物流公司工作 希望结合工作实践 给大家带来很多干货 周龙 ...
- k8s部署prometheus + grafana
k8s以Deployment方式部署prometheus + grafana: 主机说明: 系统 ip 角色 cpu 内存 hostname CentOS 7.8 192.168.30.128 mas ...
- Jmeter+Prometheus+Grafana性能监控平台:将JMeter压测数据输出到Prometheus
前言 1.小编之前写过一篇文章详细讲解了如何搭建一个HTTP接口性能实时监控测试平台,是用Grafana+Influxdb+Jmeter组合实现的,可以参考我写的这篇博客https://editor. ...
- 五分钟搭建基于 Prometheus + Grafana 实时监控系统
文章目录 Prometheus + Grafana 实时监控系统 Prometheus + Grafana 实时监控系统 依赖镜像包 启动 node-exporter check 端口 node_ex ...
- 基于Prometheus+Grafana搭建监控平台-(2)实现监控elink服务器EIMServer
EIMServer是由北京华夏易联科技开发有限公司研发的一款可以私有部署的即时通讯服务平台E-Link,服务器是基于OSGI框架研发的java服务器程序,所以也可以通过Prometheus+Grafa ...
- K8S实战之部署java应用
kubernetes基本概念 Pod: 1.最小的部署单元, 2.一组容器的集合 3.一个pod中的容器共享网络命名空间 4.生命周期是短暂的 Controllers: 1.Deployment:部署 ...
- 基于Prometheus+Grafana搭建监控平台-(5)实现监控Solr服务
基于Prometheus+Grafana搭建监控平台-(1)搭建基础环境,实现监控JVM 基于Prometheus+Grafana搭建监控平台-(2)实现监控elink服务器EIMServer 基于P ...
最新文章
- 苹果应用上架审核规则介绍
- Bag of Words模型
- java多线程之消费者生产者模式
- linux递归创建文件夹_Python中并发请求创建文件夹带来的线程安全问题
- String使用注意二
- Poj(2225),三维BFS
- Android View添加 Listener 小技巧示例
- plex android tv版,PLEX,目前最佳的IPTV直播软件,没有之一
- 中国大学MOOC 视频字幕获取方法
- LaTex学习笔记——中文排版
- ME21N/ME22N/ME23N屏幕增强BADI ME_GUI_PO_CUST
- Robcup2D足球学习记录【2020.01.14】
- 住建部发布《全国建筑工人管理服务信息平台数据对接方案》
- 智商黑洞(门萨Mensa测试)8
- 9_1 法律法规标准化
- 六级(2021/6-1) Text2
- 怎么把视频做成背景html,HTML5设置视频背景的方法介绍
- 禁止滚动条滚动和移除禁止
- edge浏览器打开多个网页卡顿解决办法
- VMware Network Adapter VMnet1 是干嘛的