Prometheus, Alertmanager, and Grafana
Relevant source files
- .renovate/minecraft.json5
- .renovaterc.json5
- bootstrap/helmfile.d/00-crds.yaml
- infrastructure/ansible/playbooks/monitoring.yaml
- kubernetes/apps/games/minecraft/app/helmrelease.yaml
- kubernetes/apps/home-automation/home-assistant/ks.yaml
- kubernetes/apps/network/envoy-gateway/app/ocirepository.yaml
- kubernetes/apps/observability/exporters/kustomization.yaml
- kubernetes/apps/observability/exporters/nut-exporter/app/dashboard/kustomization.yaml
- kubernetes/apps/observability/exporters/nut-exporter/app/kustomization.yaml
- kubernetes/apps/observability/exporters/nut-exporter/app/prometheusrule.yaml
- kubernetes/apps/observability/exporters/nut-exporter/app/servicemonitor.yaml
- kubernetes/apps/observability/exporters/smartctl-exporter/app/helmrelease.yaml
- kubernetes/apps/observability/gatus/app/externalsecret.yaml
- kubernetes/apps/observability/gatus/app/grafana-dashboard.yaml
- kubernetes/apps/observability/gatus/app/helmrelease.yaml
- kubernetes/apps/observability/gatus/app/kustomization.yaml
- kubernetes/apps/observability/gatus/app/resources/config.yaml
- kubernetes/apps/observability/gatus/ks.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/externalsecret.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/kustomization.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/ocirepository.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/scrapeconfig.yaml
- kubernetes/apps/observability/kube-prometheus-stack/ks.yaml
- kubernetes/apps/observability/loki/app/helmrelease.yaml
- kubernetes/apps/observability/loki/app/kustomization.yaml
- kubernetes/apps/observability/loki/ks.yaml
- kubernetes/apps/observability/pyroscope/app/kustomization.yaml
- kubernetes/apps/observability/pyroscope/ks.yaml
- kubernetes/apps/observability/silence-operator/app/helmrelease.yaml
- kubernetes/apps/observability/silence-operator/ks.yaml
- kubernetes/apps/observability/silence-operator/silences/silences.yaml
- kubernetes/apps/observability/tempo/app/kustomization.yaml
- kubernetes/apps/observability/tempo/ks.yaml
This page details the core observability stack deployed via the kube-prometheus-stack Helm chart. The system provides a centralized metrics collection engine, sophisticated alerting logic with multi-channel routing, and visualization through Grafana. The implementation emphasizes GitOps-driven configuration for alerting rules, dashboard discovery, and automated silence management.
Prometheus Stack Architecture
The kube-prometheus-stack is the primary metrics engine, managing the lifecycle of Prometheus instances, Alertmanager, and various exporters. It is configured to use an OCI-based chart source <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L10-L10" min=10 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.
Metrics Ingestion and Storage
Prometheus is configured with several ingestion mechanisms:
- OTLP Receiver: Enabled to support OpenTelemetry Protocol natively
<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L82-L82" min=82 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>. - Remote Write Receiver: Enabled to allow external metrics sources to push data to the cluster
<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L83-L83" min=83 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>. - Service/Pod Monitors: The stack uses
podMonitorSelectorNilUsesHelmValues: falseandserviceMonitorSelectorNilUsesHelmValues: falseto ensure the Prometheus Operator discovers all monitors across all namespaces regardless of Helm labels<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L93-L97" min=93 max=97 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>. - Retention: Data is retained for 14 days or until it reaches 50GB in size
<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L98-L99" min=98 max=99 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.
Data Flow and Code Entities
The following diagram illustrates how metrics flow from exporters through the Operator-managed entities into Prometheus.
Metrics Pipeline Overview
[Flowchart Diagram]
Sources: <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L77-L112" min=77 max=112 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>, <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L113-L138" min=113 max=138 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>
Custom Alerting Rules
The deployment includes several custom PrometheusRule groups defined within the Helm values to monitor infrastructure-specific health.
| Alert Name | Logic / Expression | Severity | Purpose |
|---|---|---|---|
ZfsUnexpectedPoolState | node_zfs_zpool_state{state!="online"} > 0 | critical | Detects ZFS pool degradation or failure <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L180-L180" min=180 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef> |
OomKilled | Detects OOMKilled reason in pod termination status over 10m | critical | Alerts on memory exhaustion for specific containers <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L159-L168" min=159 max=168 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef> |
DockerhubRateLimitRisk | Counts images from docker.io seen in the last 30s | critical | Predicts potential DockerHub rate limiting when > 100 containers pull from Hub <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L147-L152" min=147 max=152 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef> |
Sources: <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L142-L183" min=142 max=183 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>
Alertmanager Routing and Notification
Alertmanager handles the deduplication, grouping, and routing of alerts to external providers. The configuration is stored in a Kubernetes secret alertmanager-secret``<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml#L41-L41" min=41 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml">Hii</FileRef>.
Routing Logic
- Watchdog (Heartbeat): Sent to a dedicated heartbeat URL every 5 minutes to ensure the alerting pipeline is alive
<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml#L11-L16" min=11 max=16 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml">Hii</FileRef>. - VolSync Remediation: Alerts for
VolSyncVolumeOutOfSyncare routed to aremediation-webhookwhich triggers automated recovery jobs<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml#L21-L25" min=21 max=25 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml">Hii</FileRef>. - Discord & Pushover: Critical alerts are routed to both Discord (for chat-based visibility) and Pushover (for mobile push notifications)
<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml#L26-L33" min=26 max=33 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml">Hii</FileRef>.
Alert Routing Logic
[Flowchart Diagram]
Sources: <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml#L4-L53" min=4 max=53 file-path="kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml">Hii</FileRef>
Grafana and Dashboard Discovery
Grafana is managed as a separate component, but it integrates with the stack via sidecar discovery. The k8s-sidecar container (often used in Loki and Gatus deployments) watches for ConfigMaps with specific labels to automatically import dashboards.
- Dashboard Discovery: ConfigMaps labeled with
grafana_dashboard: "1"are automatically picked up<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/loki/app/helmrelease.yaml#L111-L111" min=111 file-path="kubernetes/apps/observability/loki/app/helmrelease.yaml">Hii</FileRef>. - Folder Organization: The sidecar uses the
grafana_folderannotation to organize dashboards into UI categories<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/loki/app/helmrelease.yaml#L109-L109" min=109 file-path="kubernetes/apps/observability/loki/app/helmrelease.yaml">Hii</FileRef>.
Silence Operator
The silence-operator provides a Kubernetes-native way to manage Alertmanager silences using the Silence Custom Resource Definition (CRD) <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/app/helmrelease.yaml#L17-L19" min=17 max=19 file-path="kubernetes/apps/observability/silence-operator/app/helmrelease.yaml">Hii</FileRef>.
Key silences implemented include:
- Infrastructure Noise: Silencing
NodeFilesystemAlmostOutOfSpacefor the gateway firewall and high memory utilization on specific compute nodes<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/silences/silences.yaml#L5-L36" min=5 max=36 file-path="kubernetes/apps/observability/silence-operator/silences/silences.yaml">Hii</FileRef>. - KEDA Autoscaling: Silencing
KubeHpaMaxedOutfor KEDA-managed HPA resources<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/silences/silences.yaml#L41-L48" min=41 max=48 file-path="kubernetes/apps/observability/silence-operator/silences/silences.yaml">Hii</FileRef>. - Known Issues: Temporary silences for
DisconnectedOutpostsin Authentik andetcdHighNumberOfFailedGRPCRequestsduring backup windows<FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/silences/silences.yaml#L76-L90" min=76 max=90 file-path="kubernetes/apps/observability/silence-operator/silences/silences.yaml">Hii</FileRef>.
Sources: <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/app/helmrelease.yaml#L30-L30" min=30 file-path="kubernetes/apps/observability/silence-operator/app/helmrelease.yaml">Hii</FileRef>, <FileRef file-url="https://github.com/chaijunkin/home-ops/blob/b5f8d898/kubernetes/apps/observability/silence-operator/silences/silences.yaml#L1-L91" min=1 max=91 file-path="kubernetes/apps/observability/silence-operator/silences/silences.yaml">Hii</FileRef>