Observability Stack
Relevant source files
- .renovate/minecraft.json5
- .renovaterc.json5
- bootstrap/helmfile.d/00-crds.yaml
- infrastructure/ansible/playbooks/monitoring.yaml
- kubernetes/apps/games/minecraft/app/helmrelease.yaml
- kubernetes/apps/home-automation/home-assistant/ks.yaml
- kubernetes/apps/network/envoy-gateway/app/ocirepository.yaml
- kubernetes/apps/observability/gatus/app/externalsecret.yaml
- kubernetes/apps/observability/gatus/app/grafana-dashboard.yaml
- kubernetes/apps/observability/gatus/app/helmrelease.yaml
- kubernetes/apps/observability/gatus/app/kustomization.yaml
- kubernetes/apps/observability/gatus/app/resources/config.yaml
- kubernetes/apps/observability/gatus/ks.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/externalsecret.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/ocirepository.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml
- kubernetes/apps/observability/kube-prometheus-stack/app/scrapeconfig.yaml
- kubernetes/apps/observability/loki/app/helmrelease.yaml
The observability stack provides a comprehensive view into the health, performance, and security of the cluster. It follows a “Single Pane of Glass” philosophy by aggregating metrics, logs, traces, and status monitoring into a unified Grafana-centric workflow. The stack is designed for high availability, with critical components like Gatus and Alertmanager featuring external resilience and multi-channel notification routing.
Pipeline Architecture
The observability pipeline is composed of specialized collectors that feed into centralized storage and visualization engines.
Data Flow Overview
The diagram below illustrates how telemetry flows from cluster resources to the end-user.
Title: Telemetry Data Flow
[Flowchart Diagram]
Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml113-138kubernetes/apps/observability/loki/app/helmrelease.yaml35-65kubernetes/apps/observability/gatus/app/resources/config.yaml23-31
Core Components
Prometheus, Alertmanager, and Grafana
The foundation of the stack is the kube-prometheus-stack. It manages the lifecycle of Prometheus and Alertmanager using the Prometheus Operator. Key features include:
- Metric Collection: Aggregates data from
node-exporterfor host metrics andkube-state-metricsfor cluster resource state kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml113-138 - Alerting: Custom rules handle specific failure modes such as ZFS pool degradation kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml171-182 and Dockerhub rate limiting kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml142-154
- Routing: Alertmanager routes critical alerts to Discord and Pushover, while the
remediation-webhooktriggers automated fixes kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml10-33
For details, see Prometheus, Alertmanager, and Grafana.
Loki, Tempo, and Pyroscope
This layer handles non-metric telemetry:
- Loki: Deployed in
SingleBinarymode usingtsdbindexing for efficient log storage with a 14-day retention period kubernetes/apps/observability/loki/app/helmrelease.yaml35-65 - Tempo & Pyroscope: Provide distributed tracing and continuous profiling to identify bottlenecks in application performance.
For details, see Loki, Tempo, and Pyroscope.
Gatus and Health Monitoring
Gatus serves as the public-facing status page, performing periodic health checks on internal and external endpoints.
- Auto-Discovery: Uses a
gatus-sidecarto automatically discoverHTTPRouteandServiceresources via the Kubernetes API kubernetes/apps/observability/gatus/app/helmrelease.yaml32-48 - Persistence: Stores state in a SQLite database kubernetes/apps/observability/gatus/app/resources/config.yaml5-9
- Exporters: Specialized exporters like
smartctl-exporterandnut-exporterprovide deep visibility into hardware health and UPS status.
For details, see Gatus, Exporters, and Health Monitoring.
Automated Remediation
The cluster implements self-healing patterns where Alertmanager triggers Kubernetes Jobs to resolve known issues.
- Webhook Server: Receives alerts (e.g.,
VolSyncVolumeOutOfSync) and initiates remediation scripts kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml46-49 - VolSync Remediation: Automatically triggers replication or recovery jobs when volumes fall out of sync.
For details, see Automated Remediation and Webhooks.
Code-to-System Mapping
The following diagram maps high-level observability concepts to the specific Helm releases and CRDs defined in the codebase.
Title: Observability Code Entities
[Class Diagram]
Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml5-6kubernetes/apps/observability/loki/app/helmrelease.yaml19-20kubernetes/apps/observability/gatus/app/helmrelease.yaml5-6kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml40-53
Component Summary Table
| Component | Role | Persistence | Access |
|---|---|---|---|
| Prometheus | Metric Aggregation | 50Gi openebs-hostpathkubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml109-112 | prometheus.cloudjur.com |
| Loki | Log Management | 50Gi openebs-hostpathkubernetes/apps/observability/loki/app/helmrelease.yaml80-81 | Grafana Datasource |
| Gatus | Status Monitoring | SQLite volsync backed kubernetes/apps/observability/gatus/app/resources/config.yaml5-9 | gatus.cloudjur.com |
| Alertmanager | Alert Routing | 1Gi local-hostpathkubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml50-53 | alertmanager.cloudjur.com |
Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml27-112kubernetes/apps/observability/gatus/app/helmrelease.yaml125-172kubernetes/apps/observability/loki/app/helmrelease.yaml35-81