Prometheus, Alertmanager, and Grafana

Relevant source files

This page details the core observability stack deployed via the kube-prometheus-stack Helm chart. The system provides a centralized metrics collection engine, sophisticated alerting logic with multi-channel routing, and visualization through Grafana. The implementation emphasizes GitOps-driven configuration for alerting rules, dashboard discovery, and automated silence management.

Prometheus Stack Architecture

The kube-prometheus-stack is the primary metrics engine, managing the lifecycle of Prometheus instances, Alertmanager, and various exporters. It is configured to use an OCI-based chart source <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L10-L10" min=10 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.

Metrics Ingestion and Storage

Prometheus is configured with several ingestion mechanisms:

  • OTLP Receiver: Enabled to support OpenTelemetry Protocol natively <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L82-L82" min=82 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.
  • Remote Write Receiver: Enabled to allow external metrics sources to push data to the cluster <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L83-L83" min=83 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.
  • Service/Pod Monitors: The stack uses podMonitorSelectorNilUsesHelmValues: false and serviceMonitorSelectorNilUsesHelmValues: false to ensure the Prometheus Operator discovers all monitors across all namespaces regardless of Helm labels <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L93-L97" min=93 max=97 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.
  • Retention: Data is retained for 14 days or until it reaches 50GB in size <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L98-L99" min=98 max=99 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.

Data Flow and Code Entities

The following diagram illustrates how metrics flow from exporters through the Operator-managed entities into Prometheus.

Metrics Pipeline Overview

[Flowchart Diagram]

Sources: <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L77-L112" min=77 max=112 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>, <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L113-L138" min=113 max=138 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>

Custom Alerting Rules

The deployment includes several custom PrometheusRule groups defined within the Helm values to monitor infrastructure-specific health.

Alert NameLogic / ExpressionSeverityPurpose
ZfsUnexpectedPoolStatenode_zfs_zpool_state{state!="online"} > 0criticalDetects ZFS pool degradation or failure <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L180-L180" min=180 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>
OomKilledDetects OOMKilled reason in pod termination status over 10mcriticalAlerts on memory exhaustion for specific containers <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L159-L168" min=159 max=168 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>
DockerhubRateLimitRiskCounts images from docker.io seen in the last 30scriticalPredicts potential DockerHub rate limiting when > 100 containers pull from Hub <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L147-L152" min=147 max=152 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>

Sources: <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L142-L183" min=142 max=183 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>

Alertmanager Routing and Notification

Alertmanager handles the deduplication, grouping, and routing of alerts to external providers. The configuration is stored in a Kubernetes secret alertmanager-secret``<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L41-L41" min=41 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.

Routing Logic

  1. Watchdog (Heartbeat): Sent to a dedicated heartbeat URL every 5 minutes to ensure the alerting pipeline is alive <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml#L11-L16" min=11 max=16 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml">Hii</FileRef>.
  2. VolSync Remediation: Alerts for VolSyncVolumeOutOfSync are routed to a remediation-webhook which triggers automated recovery jobs <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml#L21-L25" min=21 max=25 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml">Hii</FileRef>.
  3. Discord & Pushover: Critical alerts are routed to both Discord (for chat-based visibility) and Pushover (for mobile push notifications) <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml#L26-L33" min=26 max=33 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml">Hii</FileRef>.

Alert Routing Logic

[Flowchart Diagram]

Sources: <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml#L4-L53" min=4 max=53 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml">Hii</FileRef>

Grafana and Dashboard Discovery

Grafana is managed as a separate component, but it integrates with the stack via sidecar discovery. The k8s-sidecar container (often used in Loki and Gatus deployments) watches for ConfigMaps with specific labels to automatically import dashboards.

  • Dashboard Discovery: ConfigMaps labeled with grafana_dashboard: "1" are automatically picked up <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/loki/app/helmrelease.yaml#L111-L111" min=111 file-path="kubernetes/apps/observability/loki/app/helmrelease.yaml">Hii</FileRef>.
  • Folder Organization: The sidecar uses the grafana_folder annotation to organize dashboards into UI categories <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/loki/app/helmrelease.yaml#L109-L109" min=109 file-path="kubernetes/apps/observability/loki/app/helmrelease.yaml">Hii</FileRef>.

Silence Operator

The silence-operator provides a Kubernetes-native way to manage Alertmanager silences using the Silence Custom Resource Definition (CRD) <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/app/helmrelease.yaml#L17-L19" min=17 max=19 file-path="kubernetes/apps/observability/silence-operator/app/helmrelease.yaml">Hii</FileRef>.

Key silences implemented include:

  • Infrastructure Noise: Silencing NodeFilesystemAlmostOutOfSpace for the gateway firewall and high memory utilization on specific compute nodes <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/silences/silences.yaml#L5-L36" min=5 max=36 file-path="kubernetes/apps/observability/silence-operator/silences/silences.yaml">Hii</FileRef>.
  • KEDA Autoscaling: Silencing KubeHpaMaxedOut for KEDA-managed HPA resources <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/silences/silences.yaml#L41-L48" min=41 max=48 file-path="kubernetes/apps/observability/silence-operator/silences/silences.yaml">Hii</FileRef>.
  • Known Issues: Temporary silences for DisconnectedOutposts in Authentik and etcdHighNumberOfFailedGRPCRequests during backup windows <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/silences/silences.yaml#L76-L90" min=76 max=90 file-path="kubernetes/apps/observability/silence-operator/silences/silences.yaml">Hii</FileRef>.

Sources: <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/app/helmrelease.yaml#L30-L30" min=30 file-path="kubernetes/apps/observability/silence-operator/app/helmrelease.yaml">Hii</FileRef>, <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/silences/silences.yaml#L1-L91" min=1 max=91 file-path="kubernetes/apps/observability/silence-operator/silences/silences.yaml">Hii</FileRef>