2024-02-19    2024-02-20    637 字  2 分钟

背景

公司服务器ng-ec-pay-1到 三方VPN 多次中断,由于没有监控及告警,均没能再第一时间处理故障。为了及时发现问题并处理,因此需要通过 PrometheusVPN 链路进行监控。

思路

经调研blackbox exporter 可以实现对 httphttpstcpicmpdns 的探测。

可在 ng-ec-pay-1 上通过 blackbox_exporter 探测三方 172.27.5.41443 端口,当返回值正常时,则视为 VPN 链路正常。

安装blackbox_exporter

ng-ec-pay-1 服务器上部署 blackbox_exporter

1
2
3
4
cd /usr/local/src/
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.19.0/blackbox_exporter-0.19.0.linux-amd64.tar.gz
tar -zxf blackbox_exporter-0.19.0.linux-amd64.tar.gz -C /opt/
ln -s /opt/blackbox_exporter-0.19.0.linux-amd64 /opt/blackbox_exporter

通过 systemd 对服务进行管理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
cat > /etc/systemd/system/blackbox_exporter.service << EOF
[Unit]
Description=Blackbox_exporter daemon
After=network.target

[Service]
ExecStart=/opt/blackbox_exporter/blackbox_exporter --config.file=/opt/blackbox_exporter/blackbox.yml
User=root
Group=root
PrivateTmp=True

[Install]
WantedBy=multi-user.target
EOF

若为 ICMP探测,务必使用root用户启动服务。否则返回 probe_success 为0,并有如下报错

ts=2021-09-25T15:45:39.601164346Z caller=main.go:130 module=icmp target=192.168.199.213 level=debug msg=“Unable to do unprivileged listen on socket, will attempt privileged” err=“socket: permission denied” ts=2021-09-25T15:45:39.601271144Z caller=main.go:130 module=icmp target=192.168.199.213 level=error msg=“Error listening to socket” err=“listen ip4:icmp 0.0.0.0: socket: operation not permitted” ts=2021-09-25T15:45:39.601360342Z caller=main.go:320 module=icmp target=192.168.199.213 level=error msg=“Probe failed” duration_seconds=0.00092078

服务启动

启动服务并加入开机自启

1
2
3
systemctl daemon-reload
systemctl enable blackbox_exporter.service
systemctl start blackbox_exporter.service

验证服务

1
2
[root@ali-microloan-ng-ec-pay-1 ~]# netstat -lntp | grep 9115
tcp6       0      0 :::9115                 :::*                    LISTEN      1865/blackbox_expor

Prometheus

Prometheus 的配置文件中添加如下内容

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
  - job_name: wemabank_vpn_tcp
    honor_timestamps: true
    scrape_interval: 10s
    scrape_timeout: 10s
    metrics_path: /probe
    params:
    # 通过tcp对目标主机进行探测
      module: [tcp_connect]
    static_configs:
      - targets:
      # 要探测的地址及端口
        - 172.27.5.41:443
        # 主机标签信息
        labels:
          dc: ali-ng-ec
          env: product
          hostgroup: fg-ng-ec-microloan
          instance: ali-microloan-ng-ec-pay-1
          lan: 10.139.7.192
          node: ali-microloan-ng-ec-pay-1
          service: blackbox_exporter
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
      # blackbox_exporter地址
        replacement: 10.139.7.192:9115

重新加载配置文件。

supervisorctl signal HUP  prometheus

访问 10.139.7.192:9115 时,可看到主机探测记录及日志。

image-20210914194503761

告警配置

添加告警规则,当 probe_success 的值为 1 时,表示返回正常。当值为 0 时,表示返回不正确。

  - alert: wemabank_vpn
    annotations:
      description: 'wemabank vpn is down for more than 2mins'
    expr: probe_success{node="ali-microloan-ng-ec-pay-1"} == 0
    for: 2m
    labels:
      severity: critical

由于已加 hostgroup 标签,告警会通过 Alertmanager 发送给相应的业务人员。

image-20210914195410648


image-20231028232834657