监控节点健康

节点问题检测器（Node Problem Detector）是一个用于监控和报告节点健康状况的守护进程。你可以将其作为 DaemonSet 或独立守护进程运行。节点问题检测器从各种守护进程收集节点问题信息，并将这些状况报告给 API 服务器，作为节点 Conditions 或 Events。

要了解如何安装和使用节点问题检测器，请参阅节点问题检测器项目文档。

准备工作

你需要一个 Kubernetes 集群，并且需要配置 kubectl 命令行工具以便与你的集群通信。建议在至少有两个非控制平面主机的节点上运行本教程。如果你还没有集群，可以使用 minikube 创建一个，或者使用这些 Kubernetes 练习环境之一：

限制

节点问题检测器使用内核日志格式报告内核问题。要了解如何扩展内核日志格式，请参阅添加对其他日志格式的支持。

启用节点问题检测器

一些云供应商将节点问题检测器作为插件启用。你也可以使用 kubectl 或通过创建插件 DaemonSet 来启用节点问题检测器。

使用 kubectl 启用节点问题检测器

kubectl 提供了最灵活的节点问题检测器管理方式。你可以覆盖默认配置以适应你的环境，或者检测定制的节点问题。例如：

创建一个类似于 node-problem-detector.yaml 的节点问题检测器配置：

debug/node-problem-detector.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: registry.k8s.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/

注意

你应该验证系统日志目录是否适合你的操作系统发行版。

使用 kubectl 启动节点问题检测器：

kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

使用插件 Pod 启用节点问题检测器

如果你正在使用自定义的集群引导解决方案，并且不需要覆盖默认配置，可以利用插件 Pod 进一步自动化部署。

创建 node-problem-detector.yaml，并将配置保存到控制平面节点的插件 Pod 目录 /etc/kubernetes/addons/node-problem-detector 下。

覆盖配置

构建节点问题检测器的 Docker 镜像时，会嵌入默认配置。

不过，你可以使用 ConfigMap 来覆盖配置：

修改 config/ 中的配置文件

创建 ConfigMap node-problem-detector-config

kubectl create configmap node-problem-detector-config --from-file=config/

修改 node-problem-detector.yaml 以使用 ConfigMap：

debug/node-problem-detector-configmap.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: registry.k8s.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
        - name: config # Overwrite the config/ directory with ConfigMap volume
          mountPath: /config
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/
      - name: config # Define ConfigMap volume
        configMap:
          name: node-problem-detector-config

使用新的配置文件重新创建节点问题检测器

# If you have a node-problem-detector running, delete before recreating
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml

注意

此方法仅适用于使用 kubectl 启动的节点问题检测器。

如果节点问题检测器作为集群插件运行，则不支持覆盖配置。插件管理器不支持 ConfigMap。

问题守护进程

问题守护进程是节点问题检测器的一个子守护进程。它监控特定类型的节点问题，并将其报告给节点问题检测器。支持几种类型的问题守护进程。

SystemLogMonitor 类型的守护进程监控系统日志，并根据预定义的规则报告问题和指标。你可以为不同的日志源（例如 filelog、kmsg、内核、abrt 和 systemd）定制配置。
SystemStatsMonitor 类型的守护进程收集各种与健康相关的系统统计信息作为指标。你可以通过更新其配置文件来定制其行为。
CustomPluginMonitor 类型的守护进程通过运行用户定义的脚本来调用和检查各种节点问题。你可以使用不同的自定义插件监控器来监控不同的问题，并通过更新配置文件来定制守护进程的行为。
HealthChecker 类型的守护进程检查节点上 kubelet 和容器运行时的健康状况。

添加对其他日志格式的支持

系统日志监控器目前支持基于文件的日志、journald 和 kmsg。可以通过实现新的日志观察器来添加额外的源。

添加自定义插件监控器

通过开发自定义插件，你可以扩展节点问题检测器以执行任何语言编写的监控脚本。监控脚本必须在退出码和标准输出方面符合插件协议。欲了解更多信息，请参阅插件接口提案。

导出器

导出器将节点问题和/或指标报告给特定的后端。支持以下导出器：

Kubernetes 导出器：此导出器将节点问题报告给 Kubernetes API 服务器。临时问题作为 Event 报告，永久问题作为 Node Conditions 报告。
Prometheus 导出器：此导出器将节点问题和指标本地报告为 Prometheus（或 OpenMetrics）指标。你可以使用命令行参数指定导出器的 IP 地址和端口。
Stackdriver 导出器：此导出器将节点问题和指标报告给 Stackdriver 监控 API。可以使用配置文件定制导出行为。

建议和限制

建议在集群中运行节点问题检测器来监控节点健康状况。运行节点问题检测器时，预计每个节点上会有额外的资源开销。通常这是可以接受的，因为：

内核日志增长相对缓慢。
为节点问题检测器设置了资源限制。
即使在高负载下，资源使用情况也是可接受的。欲了解更多信息，请参阅节点问题检测器的基准测试结果。

最后修改时间：2023 年 8 月 24 日下午 6:38 PST：使用 code_sample 短代码替代 code 短代码 (e8b136c3b3)