跳到主要内容

Prometheus 在 Kubernetes 平台部署指南

概述

Prometheus 是一个开源的系统监控和告警工具包,专为云原生环境设计。它通过 HTTP 拉取指标、支持多种导出器和强大的查询语言 PromQL。本文档将详细介绍如何在 Kubernetes 平台中通过 Helm 部署 Prometheus。

目录

  1. 环境准备
  2. Helm 部署 Prometheus
  3. 网络配置
  4. 部署验证与访问

1. 环境准备

1.1 Kubernetes 平台要求

  • Kubernetes 版本: 1.20+
  • 节点配置: 至少 2 个节点,每个节点最少 2 核 4GB 内存
  • 存储类: 需要配置默认存储类(如 NFS、Local Path 等)

1.2 必需组件启用

确保以下组件已启用:

  • Ingress Controller(如 Nginx Ingress)
  • 默认 StorageClass

2. Helm 部署 Prometheus

2.1 添加 Prometheus Helm 仓库

# 添加 Prometheus Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

2.2 配置 Prometheus 参数

创建 prometheus-values.yaml 配置文件:

# Prometheus 配置
prometheus:
enabled: true
annotations: {}

# Prometheus 服务配置
service:
type: ClusterIP
port: 9090
targetPort: 9090
nodePort: 30090
annotations: {}
labels: {}
clusterIP: ""
loadBalancerIP: ""
loadBalancerSourceRanges: []

# Prometheus Ingress 配置
ingress:
enabled: false
annotations: {}
labels: {}
hosts:
- prometheus.example.com
paths:
- /
pathType: ImplementationSpecific
tls: []

# Prometheus 规则配置
prometheusSpec:
# 镜像配置
image:
repository: quay.io/prometheus/prometheus
tag: v2.44.0
sha: ""

# 资源限制
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi

# 持久化存储
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ""
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi

# 服务监控配置
serviceMonitorSelector: {}
serviceMonitorSelectorNilUsesHelmValues: false

# Pod 监控配置
podMonitorSelector: {}
podMonitorSelectorNilUsesHelmValues: false

# 规则选择器
ruleSelector: {}
ruleSelectorNilUsesHelmValues: false

# 告警管理器配置
alertingEndpoints: []

# 外部标签
externalLabels: {}

# 远程写入配置
remoteWrite: []

# 远程读取配置
remoteRead: []

# 保留时间
retention: 10d

# WAL 压缩
walCompression: true

# 管理员 API
enableAdminAPI: false

# 自动扩展
paused: false

# 镜像拉取策略
imagePullPolicy: IfNotPresent

# 镜像拉取密钥
imagePullSecrets: []

# Node Selector
nodeSelector: {}

# 容忍度
tolerations: []

# 亲和性
affinity: {}

# 安全上下文
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 2000
fsGroup: 2000

# 优先级类名
priorityClassName: ""

# 初始化容器
initContainers: []

# 附加容器
additionalContainers: []

# 附加卷
additionalVolumes: []

# 附加卷挂载
additionalVolumeMounts: []

# 配置重新加载镜像
configReloaderImage:
repository: quay.io/prometheus-operator/prometheus-config-reloader
tag: v0.66.0

# 配置重新加载资源
configReloaderResources: {}

# 端点配置
endpoints: []

# 监控命名空间
monitoringNamespace: ""

# 监控服务端点
monitoringServiceEndpoints: []

# 监控 Pod 端点
monitoringPodEndpoints: []

# 监控规则
monitoringRules: []

# 监控告警
monitoringAlerts: []

# 监控服务发现
monitoringServiceDiscovery: []

# 监控 Pod 发现
monitoringPodDiscovery: []

# 监控节点发现
monitoringNodeDiscovery: []

# 监控 Kubernetes 发现
monitoringKubernetesDiscovery: []

# 监控文件发现
monitoringFileDiscovery: []

# 监控 DNS 发现
monitoringDNSDiscovery: []

# 监控 EC2 发现
monitoringEC2Discovery: []

# 监控 Azure 发现
monitoringAzureDiscovery: []

# 监控 GCE 发现
monitoringGCEDiscovery: []

# 监控 OpenStack 发现
monitoringOpenStackDiscovery: []

# 监控 Triton 发现
monitoringTritonDiscovery: []

# 监控 Kubernetes SD 配置
monitoringKubernetesSDConfig: []

# 监控 HTTP SD 配置
monitoringHTTPSDConfig: []

# Alertmanager 配置
alertmanager:
enabled: true
annotations: {}

# Alertmanager 服务配置
service:
type: ClusterIP
port: 9093
targetPort: 9093
nodePort: 30093
annotations: {}
labels: {}
clusterIP: ""
loadBalancerIP: ""
loadBalancerSourceRanges: []

# Alertmanager Ingress 配置
ingress:
enabled: false
annotations: {}
labels: {}
hosts:
- alertmanager.example.com
paths:
- /
pathType: ImplementationSpecific
tls: []

# Alertmanager 配置
config:
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
routes:
- match:
alertname: Watchdog
receiver: 'null'
receivers:
- name: 'null'
templates:
- '/etc/alertmanager/config/*.tmpl'

# Alertmanager 持久化存储
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: ""
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi

# Node Exporter 配置
nodeExporter:
enabled: true

# Node Exporter 服务配置
service:
type: ClusterIP
port: 9100
targetPort: 9100
nodePort: 30100
annotations: {}
labels: {}

# Node Exporter 资源配置
resources:
limits:
cpu: 200m
memory: 50Mi
requests:
cpu: 100m
memory: 30Mi

# Kube State Metrics 配置
kubeStateMetrics:
enabled: true

# Kube State Metrics 服务配置
service:
type: ClusterIP
port: 8080
targetPort: 8080
nodePort: 30800
annotations: {}
labels: {}

# Kube State Metrics 资源配置
resources:
limits:
cpu: 100m
memory: 150Mi
requests:
cpu: 50m
memory: 100Mi

# Prometheus Pushgateway 配置
pushgateway:
enabled: true

# Pushgateway 服务配置
service:
type: ClusterIP
port: 9091
targetPort: 9091
nodePort: 30091
annotations: {}
labels: {}

# Pushgateway 资源配置
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 50m
memory: 50Mi

# Grafana 配置
grafana:
enabled: false # 我们将单独部署 Grafana

# 其他配置
additionalPrometheusRules: []
additionalScrapeConfigs: []
additionalAlertRelabelConfigs: []
additionalAlertManagerConfigs: []

# Kubernetes 指标配置
kubernetesServiceMonitors:
enabled: true
selectors:
matchLabels: {}

kubernetesPodMonitors:
enabled: true
selectors:
matchLabels: {}

kubernetesProbes:
enabled: true
selectors:
matchLabels: {}

kubernetesAlertmanagers:
enabled: true
selectors:
matchLabels: {}

kubernetesPrometheuses:
enabled: true
selectors:
matchLabels: {}

kubernetesThanosRulers:
enabled: true
selectors:
matchLabels: {}

kubernetesServiceMonitorsCRD:
enabled: true

kubernetesPodMonitorsCRD:
enabled: true

kubernetesProbesCRD:
enabled: true

kubernetesAlertmanagersCRD:
enabled: true

kubernetesPrometheusesCRD:
enabled: true

kubernetesThanosRulersCRD:
enabled: true

2.3 安装 Prometheus

# 创建命名空间
kubectl create namespace monitoring

# 安装 Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus-values.yaml \
--version 48.1.0

3. 网络配置

3.1 创建 Ingress

创建 prometheus-ingress.yaml 文件:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: prometheus.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-kube-prometheus-prometheus
port:
number: 9090
- host: alertmanager.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-kube-prometheus-alertmanager
port:
number: 9093

应用 Ingress 配置:

kubectl apply -f prometheus-ingress.yaml

4. 部署验证与访问

4.1 检查服务状态

# 检查 Prometheus Pod 状态
kubectl get pods -n monitoring

# 检查 Prometheus 服务状态
kubectl get svc -n monitoring

# 检查 Ingress 状态
kubectl get ingress -n monitoring

4.2 访问 Prometheus Web 界面

  1. 在本地 /etc/hosts 文件中添加域名解析:

    <节点IP> prometheus.example.com
    <节点IP> alertmanager.example.com
  2. 在浏览器中访问:

    • Prometheus: http://prometheus.example.com
    • Alertmanager: http://alertmanager.example.com

4.3 功能验证

查询指标

  1. 访问 Prometheus Web 界面
  2. 在查询框中输入 up 并执行查询
  3. 验证所有目标是否正常运行

告警测试

  1. 访问 Alertmanager Web 界面
  2. 检查是否有告警触发
  3. 验证告警通知机制