Observability Stack

Relevant source files

The observability stack provides a comprehensive view into the health, performance, and security of the cluster. It follows a “Single Pane of Glass” philosophy by aggregating metrics, logs, traces, and status monitoring into a unified Grafana-centric workflow. The stack is designed for high availability, with critical components like Gatus and Alertmanager featuring external resilience and multi-channel notification routing.

Pipeline Architecture

The observability pipeline is composed of specialized collectors that feed into centralized storage and visualization engines.

Data Flow Overview

The diagram below illustrates how telemetry flows from cluster resources to the end-user.

Title: Telemetry Data Flow

[Flowchart Diagram]

Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml113-138 kubernetes/apps/observability/loki/app/helmrelease.yaml35-65 kubernetes/apps/observability/gatus/app/resources/config.yaml23-31

Core Components

Prometheus, Alertmanager, and Grafana

The foundation of the stack is the kube-prometheus-stack. It manages the lifecycle of Prometheus and Alertmanager using the Prometheus Operator. Key features include:

Metric Collection: Aggregates data from node-exporter for host metrics and kube-state-metrics for cluster resource state kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml113-138
Alerting: Custom rules handle specific failure modes such as ZFS pool degradation kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml171-182 and Dockerhub rate limiting kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml142-154
Routing: Alertmanager routes critical alerts to Discord and Pushover, while the remediation-webhook triggers automated fixes kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml10-33

For details, see Prometheus, Alertmanager, and Grafana.

Loki, Tempo, and Pyroscope

This layer handles non-metric telemetry:

Loki: Deployed in SingleBinary mode using tsdb indexing for efficient log storage with a 14-day retention period kubernetes/apps/observability/loki/app/helmrelease.yaml35-65
Tempo & Pyroscope: Provide distributed tracing and continuous profiling to identify bottlenecks in application performance.

For details, see Loki, Tempo, and Pyroscope.

Gatus and Health Monitoring

Gatus serves as the public-facing status page, performing periodic health checks on internal and external endpoints.

Auto-Discovery: Uses a gatus-sidecar to automatically discover HTTPRoute and Service resources via the Kubernetes API kubernetes/apps/observability/gatus/app/helmrelease.yaml32-48
Persistence: Stores state in a SQLite database kubernetes/apps/observability/gatus/app/resources/config.yaml5-9
Exporters: Specialized exporters like smartctl-exporter and nut-exporter provide deep visibility into hardware health and UPS status.

For details, see Gatus, Exporters, and Health Monitoring.

Automated Remediation

The cluster implements self-healing patterns where Alertmanager triggers Kubernetes Jobs to resolve known issues.

Webhook Server: Receives alerts (e.g., VolSyncVolumeOutOfSync) and initiates remediation scripts kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml46-49
VolSync Remediation: Automatically triggers replication or recovery jobs when volumes fall out of sync.

For details, see Automated Remediation and Webhooks.

Code-to-System Mapping

The following diagram maps high-level observability concepts to the specific Helm releases and CRDs defined in the codebase.

Title: Observability Code Entities

[Class Diagram]

Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml5-6 kubernetes/apps/observability/loki/app/helmrelease.yaml19-20 kubernetes/apps/observability/gatus/app/helmrelease.yaml5-6 kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml40-53

Component Summary Table

Component	Role	Persistence	Access
Prometheus	Metric Aggregation	50Gi `openebs-hostpath`kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml109-112	`prometheus.cloudjur.com`
Loki	Log Management	50Gi `openebs-hostpath`kubernetes/apps/observability/loki/app/helmrelease.yaml80-81	Grafana Datasource
Gatus	Status Monitoring	SQLite `volsync` backed kubernetes/apps/observability/gatus/app/resources/config.yaml5-9	`gatus.cloudjur.com`
Alertmanager	Alert Routing	1Gi `local-hostpath`kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml50-53	`alertmanager.cloudjur.com`

Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml27-112 kubernetes/apps/observability/gatus/app/helmrelease.yaml125-172 kubernetes/apps/observability/loki/app/helmrelease.yaml35-81

Cloudjur

Explorer

Observability-Stack