docker-compose 部署prometheus+grafana+alertmanager+chronograf+prometheus-webhook-dingtalk+loki
tree树状图
├── alertmanager
│ ├── alertmanager.yml
│ └── alertmanager.yml_bak
├── chronograf
├── docker-compose.yaml
├── grafana
│ └── grafana.ini
├── influxdb
│ ├── influxdb.conf
├── loki
│ └── loki-local-config.yaml
├── prometheus
│ ├── kafka
│ │ └── kafka.yaml_bak
│ ├── prod
│ │ ├── jvm.yml
│ │ ├── jvm.yml_bak
│ │ ├── nginx.yml
│ │ ├── node.yml
│ │ └── rocketmq.yml_bak
│ ├── prometheus.yml
│ ├── rule
│ │ ├── alert.yml
│ │ ├── node_down.yml
│ │ └── rules_rocketmq.yml
│ └── test
│ ├── jvm.yml.bak
│ ├── nginx.yml
│ └── node.yml
├── prometheus-am-executor
└── prometheus-webhook-dingtalk└── config.yml
1、编写docker-compose.yml服务
version: "3"#networks:
# loki:
#volumes:
# prometheus:services:loki:container_name: lokiimage: grafana/loki:master-96515e3ports:- "3100:3100"privileged: trueenvironment:- TZ=Asia/Shanghaivolumes:- ./loki:/etc/lokicommand: -config.file=/etc/loki/loki-local-config.yamlrestart: always
# networks:
# - lokiprometheus:container_name: prometheusimage: prom/prometheus:v2.20.0-rc.1privileged: trueenvironment:- TZ=Asia/Shanghaivolumes:#- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml- ./prometheus/:/etc/prometheus/#- prometheus:/prometheus:/etc/prometheuscommand:- '--config.file=/etc/prometheus/prometheus.yml'- '--web.enable-lifecycle'ports:- "9090:9090"restart: alwaysdepends_on:- influxdbalertmanager:image: prom/alertmanager:v0.21.0container_name: alertmanagerhostname: alertmanagerrestart: alwaysprivileged: trueenvironment:- TZ=Asia/Shanghaivolumes:- ./alertmanager/:/etc/alertmanager/ports:- "9093:9093"depends_on:- prometheus- prometheus-webhook-dingtalk# - webhook-adapter# webhook-adapter:# image: guyongquan/webhook-adapter:latest# container_name: webhook-adapter# hostname: webhook-adapter# restart: always# command:# - '--adapter=/app/prometheusalert/wx.js=/adapter/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=99ee3f72-2831-4bf0-9940-0f21124487eb'# - '--adapter=/app/prometheusalert/wx.js=/adapter/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=940-0f21124487eb'# 钉钉插件prometheus-webhook-dingtalk:image: timonwong/prometheus-webhook-dingtalk:latestcontainer_name: prometheus-webhook-dingtalkports:- "8060:8060"hostname: prometheus-webhook-dingtalkrestart: always#command:
# - '--ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=b71a4d4fd97611aef8f5d1dd8250f0b9065d9c1d11acdf42fe795c93c5ec37f1'volumes:- type: bindsource: ./prometheus-webhook-dingtalk/config.ymltarget: /etc/prometheus-webhook-dingtalk/config.ymlread_only: truegrafana:container_name: grafanaimage: grafana/grafana:8.4.0ports:- "3000:3000"privileged: trueenvironment:- TZ=Asia/Shanghaivolumes:- ./grafana/grafana.ini:/etc/grafana/grafana.ini- ./grafana/data:/var/lib/grafanaenvironment:GF_EXPLORE_ENABLED: "true"restart: alwaysdepends_on:- prometheus- loki# promtail:
# image: grafana/promtail:make-images-static-26a87c9
# volumes:
# - .:/etc/promtail
# - /var/log:/var/log
# command:
# -config.file=/etc/promtail/promtail-docker-config.yaml
# networks:
# - lokiinfluxdb:image: influxdb:1.7container_name: influxdbrestart: alwayshostname: influxdbports:- "8086:8086"privileged: trueenvironment:- TZ=Asia/Shanghaivolumes:- /etc/localtime:/etc/localtime:ro- ./influxdb:/var/lib/influxdb- ./influxdb/influxdb.conf:/etc/influxdb/influxdb.confcommand:-- godebug=madvdontneed=1chronograf:image: chronograf:1.7container_name: chronografrestart: alwayshostname: chronografports:- "8083:8888"privileged: trueenvironment:- TZ=Asia/Shanghaivolumes:- ./chronograf:/var/lib/chronografdepends_on:- influxdbcadvisor:image: google/cadvisor:v0.33.0container_name: cadvisorrestart: alwayshostname: cadvisorports:- "8080:8080"privileged: trueenvironment:- TZ=Asia/Shanghaivolumes:- /:/rootfs:ro- /var/run:/var/run:ro- /sys:/sys:ro- /var/lib/docker/:/var/lib/docker:ro- /dev/disk/:/dev/disk:ro
注:以下为企业微信通知配置
# alertmanager配置依赖depend on- webhook-adapterwebhook-adapter:image: guyongquan/webhook-adapter:latestcontainer_name: webhook-adapterhostname: webhook-adapterrestart: alwayscommand:- '--adapter=/app/prometheusalert/wx.js=/adapter/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=940-0f21124487eb'
此时对应的alertmanager.yml如图如下
global:resolve_timeout: 5mroute:group_by: [...]group_wait: 10sgroup_interval: 10srepeat_interval: 30mreceiver: 'wx-robot'routes:- receiver: 'wx-robot'continue: truematch_re:alertname: ".*"- receiver: 'rxy'continue: truematch_re:alertname: (InstanceDown|NodeFilesystemUsage2)env: prodreceivers:
- name: 'wx-robot'webhook_configs:- url: 'http://webhook-adapter/adapter/wx'send_resolved: true
- name: 'rxy'webhook_configs:#- url: 'http://api.aiops.com/alert/api/event/prometheus/XXXXXXXXXXXXXXXXXX'# send_resolved: trueinhibit_rules:- source_match:severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'dev', 'instance']
2、钉钉配置
alertmanager/alertmanager.yml配置如下
global: resolve_timeout: 5m
route: receiver: prometheus-webhook-dingtalkgroup_wait: 10s group_interval: 15s repeat_interval: 30s group_by: [alertname] routes: - receiver: prometheus-webhook-dingtalkgroup_wait: 10s
receivers:
- name: prometheus-webhook-dingtalkwebhook_configs: - url: http://prometheus-webhook-dingtalk:8060/dingtalk/webhook1/send send_resolved: trueinhibit_rules:- source_match:severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'dev', 'instance']
其中docker-compose.yml部署
# 钉钉插件prometheus-webhook-dingtalk:image: timonwong/prometheus-webhook-dingtalk:latestcontainer_name: prometheus-webhook-dingtalkports:- "8060:8060"hostname: prometheus-webhook-dingtalkrestart: always#command:
# - '--ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXXXXXXXXXXXXXXXXXXX'volumes:- type: bindsource: ./prometheus-webhook-dingtalk/config.ymltarget: /etc/prometheus-webhook-dingtalk/config.ymlread_only: true
prometheus-webhook-dingtalk/config.yml部分配置
targets:webhook1:#修改为钉钉机器人的webhookurl: https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXmention:all: true
3、其余配置都一致
loki/loki-local-config.yaml配置文件内容
auth_enabled: falseserver:http_listen_port: 3100ingester:lifecycler:address: 127.0.0.1ring:kvstore:store: inmemoryreplication_factor: 1final_sleep: 0schunk_idle_period: 5mchunk_retain_period: 30smax_transfer_retries: 0schema_config:configs:- from: 2018-04-15store: boltdbobject_store: filesystemschema: v11index:prefix: index_period: 168hstorage_config:boltdb:directory: /tmp/loki/indexfilesystem:directory: /tmp/loki/chunkslimits_config:enforce_metric_name: falsereject_old_samples: truereject_old_samples_max_age: 168hingestion_rate_mb: 16ingestion_burst_size_mb: 16#最大可查询历史时间
chunk_store_config:max_look_back_period: 168h#开启保留策略,过期删除,168h的倍数
table_manager:retention_deletes_enabled: trueretention_period: 168h
prometheus总配置文件
prometheus/prometheus.yml
global:scrape_interval: 40s # 多久 收集 一次数据evaluation_interval: 40s # 多久评估一次 规则scrape_timeout: 20s # 每次 收集数据的 超时时间# external_labels:
# monitor: 'codelab-monitor'alerting:alertmanagers:- static_configs:- targets: ["alertmanager:9093"]rule_files:- rule/*.yml
# - "node_down.yml"
# - "memory_over.yml"scrape_configs:- job_name: 'prometheus'scrape_interval: 5sstatic_configs:- targets: ['prometheus:9090']# - job_name: 'node'
# scrape_interval: 5s
# static_configs:
# - targets: ['172.31.28.235:9100']
# - targets: ['172.31.32.111:9100']
# labels:
# env: 'prod'
# job: 'node'# - job_name: 'vpc_md_kafka'
# scrape_interval: 5s
# static_configs:
# - targets: ['172.63.11.78:9308']#file_sd_configs:#- files:# - kafka/*.yml# refresh_interval: 30s- job_name: 'Biking_JVM_Monitor'
# scrape_interval: 5sfile_sd_configs:- files:- prod/*.ymlrefresh_interval: 40s- job_name: 'node-test'
# scrape_interval: 5sfile_sd_configs:- files:- test/*.ymlrefresh_interval: 40s- files:- pre/*.ymlrefresh_interval: 30s# - job_name: 'nginx_status_module' # 采集nginx的指标#metrics_path: '/metrics' # 拉取指标的接口路径#file_sd_configs:#- files:# - prod/nginx.yml#scrape_interval: 10s # # - job_name: mysql-prod
# static_configs:
# - targets: ['172.31.11.215:9104']
# labels:
# env: prod# - job_name: nginx-prod
# static_configs:
# - targets: ['172.31.30.4:9913','172.31.29.125:9913']
# labels:
# env: prod
# for elasticsearch
# - job_name: 'elasticsearch'
# scrape_interval: 60s
# scrape_timeout: 60s
# static_configs:
# - targets: ['172.63.9.12:9308','172.63.9.32:9308','172.63.9.211:9308']
# labels:
# service: elasticsearch# - job_name: 'clickhouse'
# scrape_interval: 60s
# scrape_timeout: 60s
# static_configs:
# - targets: ['172.63.9.211:9363']
# labels:
# service: clickhouse- job_name: 'rocketmq'file_sd_configs:- files:- prod/rocketmq.ymlremote_write:- url: "http://influxdb:8086/api/v1/prom/write?db=prometheus"basic_auth:username: adminpassword: adminqueue_config:capacity: 1000max_shards: 500min_shards: 200max_samples_per_send: 200batch_send_deadline: 10s
remote_read:- url: "http://influxdb:8086/api/v1/prom/read?db=prometheus"basic_auth:username: adminpassword: admin
其中prometheus路径如下
kafka路径下
kafka/kafka.yaml
- targets:- '172.63.11.78:9308'
prod路径
jvm配置项,prod/jvm.yml
##################################
#- targets:
# - 'xxx.xxx.xxx.xxx:10039'
# labels:
# env: prod
# project: biking
# name: kubiex-xxx
prod/nginx.yml 配置项
- targets:- '172.16.0.165:9113'labels: env: Biking_Prod_Admin
prod/node.yml
- targets:- '172.16.0.154:9100'- '172.16.0.155:9100'- '172.16.0.156:9100'
prod/rocketmq.yml配置
- targets: - '172.63.1.135:5557'labels: env: prodproject: biking name: rocketmqcluster: cluster
rule/alert.yml 配置
groups:
- name: NodeMemoryUsagerules:- alert: NodeMemoryUsageexpr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 90for: 5mannotations:summary: "{{$labels.instance}}: 内存占用高"description: "{{$labels.instance}}: 内存占用超过 90% (当前占用:{{ $value }})"- name: NodeFilesystemUsagerules:- alert: NodeFilesystemUsageexpr: ((node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}+(node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}))) > 85for: 5mannotations:summary: "{{$labels.instance}}: 磁盘分区容量占用高"description: "{{$labels.mountpoint}}: 磁盘分区占用超过 85% (当前占用:{{ $value }})"- name: NodeFilesystemUsage2rules:- alert: NodeFilesystemUsage2expr: ((node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}+(node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}))) > 90for: 1mannotations:summary: "{{$labels.instance}}: 磁盘分区容量占用高"description: "{{$labels.mountpoint}}: 磁盘分区占用超过 90% (当前占用:{{ $value }})"- name: NodeCpuUsagerules:- alert: NodeCpuUsageexpr: (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))*120 > 90for: 5mannotations:summary: "{{$labels.instance}}: CPU占用高"description: "{{$labels.instance}}: CPU占用超过 92% (当前占用:{{ $value }})"- name: Node_load1rules:- alert: Node_load1expr: (sum(node_load5{job='node-prod'}) by (instance)) > (sum(count(node_cpu_seconds_total{job='node-prod',mode='system'}) by (cpu,instance)) by(instance))for: 5mannotations:summary: "{{$labels.instance}}: 系统负载高"description: "{{$labels.instance}}: 系统负载超过最大值 (当前占用:{{ $value }})"
rule/node_down.yml
groups:
- name: InstanceDownrules: - alert: InstanceDownexpr: up == 0for: 10mannotations:summary: "服务 {{ $labels.instance }} 连接中断"description: "{{ $labels.instance }} :(端口9100为服务器,xxxx9为JAVA服务)"
rule/rules_rocketmq.yml
# rules_rocketmq.yml
groups:
- name: rocketmqrules:- alert: RocketMQ Exporter is Down expr: up{job="rocketmq"} == 0for: 20slabels: severity: '灾难'annotations:summary: RocketMQ {{ $labels.instance }} is down- alert: RocketMQ 存在消息积压expr: (sum(irate(rocketmq_producer_offset[1m])) by (topic) - on(topic) group_right sum(irate(rocketmq_consumer_offset[1m])) by (group,topic)) > 5for: 5mlabels: severity: '警告'annotations:summary: RocketMQ (group={{ $labels.group }} topic={{ $labels.topic }})积压数 = {{ .Value }}- alert: GroupGetLatencyByStoretime 消费组的消费延时时间过高expr: rocketmq_group_get_latency_by_storetime/1000 > 10 and rate(rocketmq_group_get_latency_by_storetime[5m]) >0for: 3mlabels:severity: 警告annotations:description: 'consumer {{$labels.group}} on {{$labels.broker}}, {{$labels.topic}} consume time lag behind message store time and (behind value is {{$value}}).'summary: 消费组的消费延时时间过高- alert: RocketMQClusterProduceHigh 集群TPS > 60expr: sum(rocketmq_producer_tps) by (cluster) >= 60for: 3mlabels:severity: 警告annotations:description: '{{$labels.cluster}} Sending tps too high. now TPS = {{ .Value }}'summary: cluster send tps too high
最后我们在目录下执行启动
docker-compose up -d
docker-compose psName Command State Ports
----------------------------------------------------------------------------------------------------------------------
alertmanager /bin/alertmanager --config ... Up 0.0.0.0:9093->9093/tcp,:::9093->9093/tcp
cadvisor /usr/bin/cadvisor -logtostderr Up (healthy) 0.0.0.0:8080->8080/tcp,:::8080->8080/tcp
chronograf /entrypoint.sh chronograf Up 0.0.0.0:8083->8888/tcp,:::8083->8888/tcp
grafana /run.sh Up 0.0.0.0:3000->3000/tcp,:::3000->3000/tcp
influxdb /entrypoint.sh -- godebug= ... Up 0.0.0.0:8086->8086/tcp,:::8086->8086/tcp
loki /usr/bin/loki -config.file ... Up 0.0.0.0:3100->3100/tcp,:::3100->3100/tcp
prometheus /bin/prometheus --config.f ... Up 0.0.0.0:9090->9090/tcp,:::9090->9090/tcp
prometheus-webhook-dingtalk /bin/prometheus-webhook-di ... Up 0.0.0.0:8060->8060/tcp,:::8060->8060/tcp
4、最后重置grafana的密码
docker exec -it grafana /bin/bash
bash-5.1$ grafana-cli admin reset-admin-password XXXXXXXXXX
INFO[02-24|20:52:28] Starting Grafana logger=settings version= commit= branch= compiled=1970-01-01T00:00:00+0000
INFO[02-24|20:52:28] The state of unified alerting is still not defined. The decision will be made during as we run the database migrations logger=settings
WARN[02-24|20:52:28] falling back to legacy setting of 'min_interval_seconds'; please use the configuration option in the `unified_alerting` section if Grafana 8 alerts are enabled. logger=settings
INFO[02-24|20:52:28] Config loaded from logger=settings file=/usr/share/grafana/conf/defaults.ini
INFO[02-24|20:52:28] Config overridden from Environment variable logger=settings var="GF_PATHS_DATA=/var/lib/grafana"
INFO[02-24|20:52:28] Config overridden from Environment variable logger=settings var="GF_PATHS_LOGS=/var/log/grafana"
INFO[02-24|20:52:28] Config overridden from Environment variable logger=settings var="GF_PATHS_PLUGINS=/var/lib/grafana/plugins"
INFO[02-24|20:52:28] Config overridden from Environment variable logger=settings var="GF_PATHS_PROVISIONING=/etc/grafana/provisioning"
INFO[02-24|20:52:28] Config overridden from Environment variable logger=settings var="GF_EXPLORE_ENABLED=true"
INFO[02-24|20:52:28] Path Home logger=settings path=/usr/share/grafana
INFO[02-24|20:52:28] Path Data logger=settings path=/var/lib/grafana
INFO[02-24|20:52:28] Path Logs logger=settings path=/var/log/grafana
INFO[02-24|20:52:28] Path Plugins logger=settings path=/var/lib/grafana/plugins
INFO[02-24|20:52:28] Path Provisioning logger=settings path=/etc/grafana/provisioning
INFO[02-24|20:52:28] App mode production logger=settings
INFO[02-24|20:52:28] Connecting to DB logger=sqlstore dbtype=sqlite3
WARN[02-24|20:52:28] SQLite database file has broader permissions than it should logger=sqlstore path=/var/lib/grafana/grafana.db mode=-rw-r--r-- expected=-rw-r-----
INFO[02-24|20:52:28] Starting DB migrations logger=migrator
INFO[02-24|20:52:28] migrations completed logger=migrator performed=0 skipped=381 duration=344.934µsAdmin password changed successfully ✔bash-5.1$
登陆页面我们配置好了一个jvm如下图,
登录用户:admin
密码:刚刚我们重置的密码
docker-compose 部署prometheus+grafana+alertmanager+chronograf+prometheus-webhook-dingtalk+loki相关推荐
- (四) prometheus + grafana + alertmanager 配置Kafka监控
安装请看https://blog.51cto.com/liuqs/2027365 ,最好是对应的版本组件,否则可能会有差别. (一)prometheus + grafana + alertmanage ...
- docker compose部署服务
1 用docker compose部署服务 - 需求:假如现在我们手里有很多容器,每个容器对应每个服务,有nginx容器,redis容器,mysql容器等.现在我们需要批量化的去管理,批量启动,停止, ...
- 使用Docker Compose 部署Nexus后提示:Unable to create directory /nexus-data/instance
场景 Ubuntu Server 上使用Docker Compose 部署Nexus(图文教程): https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/ ...
- 使用Docker Compose 部署Nexus后初次登录账号密码不正确,并且在nexus-data下没有admin.password
场景 Ubuntu Server 上使用Docker Compose 部署Nexus(图文教程): https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/ ...
- Ubuntu Server 上使用Docker Compose 部署Nexus(图文教程)
场景 Docker-Compose简介与Ubuntu Server 上安装Compose: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/deta ...
- Docker Compose部署Nexus3时的docker-compose.yml代码
场景 Docker-Compose简介与Ubuntu Server 上安装Compose: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/deta ...
- Docker Compose部署GitLab服务,搭建自己的代码托管平台(图文教程)
场景 Docker-Compose简介与Ubuntu Server 上安装Compose: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/deta ...
- Docker Compose部署项目到容器-基于Tomcat和mysql的项目yml配置文件代码
场景 Docker-Compose简介与Ubuntu Server 上安装Compose: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/deta ...
- 使用Docker Compose部署SpringBoot应用
使用Docker Compose部署SpringBoot应用 原创: 梦想de星空 macrozheng 6月19日 Docker Compose是一个用于定义和运行多个docker容器应用的工具.使 ...
最新文章
- 互联网金融下半场 BAT谁是老大?
- C语言三路基数快排multikey quick sort算法(附完整源码)
- Spark分区器HashPartitioner和RangePartitioner代码详解
- Sqoop拒绝连接错误
- java 线程访问控件_C#多线程与跨线程访问界面控件的方法
- arraylist 初始化_ArrayList(JDK1.8)源码解析
- react-router v4 路由规则解析
- 年轻人的第一个自定义Springboot starter
- C语言练习实例15——条件运算符的嵌套
- java 进销存源码_JAVA 进销存管理系统的源码 - 下载 - 搜珍网
- Maven与Eclipse的整合和简单的Maven项目(二)
- vbs进阶——常用函数之inputbox篇(末尾有彩蛋)
- ListView组件的应用
- unity 场景背景替换2D图片方法
- 《TypeScript》入门与精通-.d.ts描述文件的使用和详解
- 多变量微积分笔记(1)——向量和矩阵
- 双端口USB Type-C控制器 CYPD6227 (CYPD6227-96BZXI)
- unity android 播放器,Unity3D 安卓视频播放插件 WRP Android Video Player Pro
- 查看CAD图纸时,如何改背景颜色呢?
- Flex和Flash一起使用开发项目各取所长