Prometheus+Alertmanager+webhook-dingtalk实现钉钉告警
本文最后更新于 2024-07-22,
若内容或图片失效,请留言反馈。部分素材来自网络,若不小心影响到您的利益, 请联系我 删除。
本站只有Telegram群组为唯一交流群组, 点击加入
文章内容有误?申请成为本站文章修订者或作者? 向站长提出申请
原文:https://www.cnblogs.com/JIKes/p/18183559
创建钉钉机器人
安装及启动
安装配置只涉及到安装及正常启动无误,并不涉及配置内容!
创建 /usr/local/prometheus
目录,设计到所有安装的服务,咱们都放到此目录中。
mkdir /usr/local/prometheus
配置时间同步 && 定时同步
yum -y install ntpdate
ntpdate ntp1.aliyun.com
echo "0 1 * * * ntpdate ntp1.aliyun.com" >> /var/spool/cron/root
crontab -l
Prometheus安装启动
wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz
tar zxf prometheus-2.53.1.linux-amd64.tar.gz
mv prometheus-2.53.1.linux-amd64 /usr/local/prometheus/prometheus
配置Prometheus使用systemd管理
vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus-Server
Documentation=https://prometheus.io/
After=network.target
[Service]
ExecStart=/usr/local/prometheus/prometheus/prometheus --web.listen-address=0.0.0.0:9091 --config.file=/usr/local/prometheus/prometheus/prometheus.yml
User=root
[Install]
WantedBy=multi-user.target
启动 && 开机自启
systemctl enable prometheus --now
systemctl status prometheus
浏览器访问 http://IP:9091
,显示下图表示无误~
Node_export安装启动
安装node_export
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar zxf node_exporter-v1.8.2.linux-amd64.tar.gz
mv node_exporter-v1.8.2.linux-amd64 /usr/local/prometheus/node_exporter
配置node_exporter使用systemd管理
vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=Prometheus-Server
After=network.target
[Service]
ExecStart=/usr/local/prometheus/node_exporter/node_exporter --web.listen-address=0.0.0.0:9100
User=root
[Install]
WantedBy=multi-user.target
启动 && 开机自启
systemctl enable node_exporter --now
systemctl status node_exporter
浏览器访问 http://IP:9100/metrics,显示下图表示无误~
Alertmanager安装启动
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar zxf alertmanager-0.27.0.linux-amd64.tar.gz
mv alertmanager-0.27.0.linux-amd64 /usr/local/prometheus/alertmanager
配置Alertmanager使用systemd管理
vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=Prometheus-Server
After=network.target
[Service]
ExecStart=/usr/local/prometheus/alertmanager/alertmanager --cluster.advertise-address=0.0.0.0:9093 --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml
User=root
[Install]
WantedBy=multi-user.target
启动 && 开机自启
systemctl enable alertmanager --now
systemctl status alertmanager
浏览器访问 http://IP:9093,显示下图表示无误~
Webhook-dingtalk安装启动
安装webhook-dingtalk插件
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
tar zxf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/prometheus/webhook-dingtalk
配置webhook-dingtalk使用systemd管理
cp /usr/local/prometheus/webhook-dingtalk/config.example.yml /usr/local/prometheus/webhook-dingtalk/config.yml
vim /usr/lib/systemd/system/webhook.service
[Unit]
Description=Prometheus-Server
After=network.target
[Service]
ExecStart=/usr/local/prometheus/webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prometheus/webhook-dingtalk/config.yml
User=root
[Install]
WantedBy=multi-user.target
启动 && 开机自启
systemctl enable webhook.service --now
systemctl status webhook.service
验证,查看端口是否启动
netstat -anput |grep 8060
配置及测试
Webhook-dingtalk配置钉钉webhook地址
Webhook-dingtalk配置相对比较简单,只改以下三处即可,如下图:
加签秘钥、webhook地址是咱们在钉钉创建机器人时获取的!
vim /usr/local/prometheus/webhook-dingtalk/config.yml
添加钉钉报警模板
vim /usr/local/prometheus/webhook-dingtalk/template.tmpl
{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}
{{ define "__alert_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}
**告警主题**: {{ .Annotations.summary }}
**告警类型**: {{ .Labels.alertname }}
**告警级别**: {{ .Labels.severity }}
**告警主机**: {{ .Labels.instance }}
**告警信息**: {{ index .Annotations "description" }}
**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
{{ end }}{{ end }}
{{ define "__resolved_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}
**告警主题**: {{ .Annotations.summary }}
**告警类型**: {{ .Labels.alertname }}
**告警级别**: {{ .Labels.severity }}
**告警主机**: {{ .Labels.instance }}
**告警信息**: {{ index .Annotations "description" }}
**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
{{ end }}{{ end }}
{{ define "default.title" }}
{{ template "__subject" . }}
{{ end }}
{{ define "default.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len }}个故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len }}个故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
{{ template "default.title" . }}
{{ template "default.content" . }}
重启
systemctl restart webhook
systemctl status webhook
Alertmanager配置钉钉告警
vim /usr/local/prometheus/alertmanager/alertmanager.yml
route:
group_by: ['dingding']
group_wait: 30s
group_interval: 1h
repeat_interval: 1h
receiver: 'dingding.webhook1'
routes:
- receiver: 'dingding.webhook1'
match_re:
alertname: ".*"
receivers:
- name: 'dingding.webhook1'
webhook_configs:
- url: 'http://16.32.15.200:8060/dingtalk/webhook1/send'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
systemctl restart alertmanager
systemctl status alertmanager
Prometheus集成Alertmanager及告警规则配置
vim /usr/local/prometheus/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- 16.32.15.200:9093
rule_files:
- "/usr/local/prometheus/prometheus/rule/node_exporter.yml"
scrape_configs:
- job_name: "VMware-16.32.15.200"
static_configs:
- targets: ["16.32.15.200:59100"]
添加node_exporter告警规则
mkdir /usr/local/prometheus/prometheus/rule
vim /usr/local/prometheus/prometheus/rule/node_exporter.yml
groups:
- name: 服务器资源监控
rules:
- alert: 内存使用率过高
expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80
for: 3m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 内存使用率过高, 请尽快处理!"
description: "{{ $labels.instance }}内存使用率超过80%,当前使用率{{ $value }}%."
- alert: 服务器宕机
expr: up == 0
for: 1s
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 服务器宕机, 请尽快处理!"
description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "
- alert: CPU高负荷
expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "
- alert: 磁盘IO性能
expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理!"
description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%."
- alert: 网络流入
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!"
description: "{{$labels.instance}} 流入网络带宽持续5分钟高于100M. RX带宽使用量{{$value}}."
- alert: 网络流出
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!"
description: "{{$labels.instance}} 流出网络带宽持续5分钟高于100M. RX带宽使用量{$value}}."
- alert: TCP连接数
expr: node_netstat_Tcp_CurrEstab > 10000
for: 2m
labels:
severity: 严重告警
annotations:
summary: " TCP_ESTABLISHED过高!"
description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%."
- alert: 磁盘容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
for: 1m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"
description: "{{$labels.instance}} 磁盘分区使用大于90%,当前使用率{{ $value }}%."
重启服务
systemctl restart prometheus
systemctl status prometheus
访问Prometheus Web页面可以查看到添加的规则,如下图:
测试告警
故意将node_exporter停止掉,模拟服务器宕机
systemctl stop node_exporter
Prometheus会将告警信息发送给Alertmanager,所以说Alertmanager页面可以看到告警信息如下图:
评论
隐私政策
你无需删除空行,直接评论以获取最佳展示效果