Observability Stack

Relevant source files

The observability stack provides a comprehensive view into the health, performance, and security of the cluster. It follows a “Single Pane of Glass” philosophy by aggregating metrics, logs, traces, and status monitoring into a unified Grafana-centric workflow. The stack is designed for high availability, with critical components like Gatus and Alertmanager featuring external resilience and multi-channel notification routing.

Pipeline Architecture

The observability pipeline is composed of specialized collectors that feed into centralized storage and visualization engines.

Data Flow Overview

The diagram below illustrates how telemetry flows from cluster resources to the end-user.

Title: Telemetry Data Flow

[Flowchart Diagram]

Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml113-138kubernetes/apps/observability/loki/app/helmrelease.yaml35-65kubernetes/apps/observability/gatus/app/resources/config.yaml23-31


Core Components

Prometheus, Alertmanager, and Grafana

The foundation of the stack is the kube-prometheus-stack. It manages the lifecycle of Prometheus and Alertmanager using the Prometheus Operator. Key features include:

For details, see Prometheus, Alertmanager, and Grafana.

Loki, Tempo, and Pyroscope

This layer handles non-metric telemetry:

For details, see Loki, Tempo, and Pyroscope.

Gatus and Health Monitoring

Gatus serves as the public-facing status page, performing periodic health checks on internal and external endpoints.

For details, see Gatus, Exporters, and Health Monitoring.

Automated Remediation

The cluster implements self-healing patterns where Alertmanager triggers Kubernetes Jobs to resolve known issues.

For details, see Automated Remediation and Webhooks.


Code-to-System Mapping

The following diagram maps high-level observability concepts to the specific Helm releases and CRDs defined in the codebase.

Title: Observability Code Entities

[Class Diagram]

Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml5-6kubernetes/apps/observability/loki/app/helmrelease.yaml19-20kubernetes/apps/observability/gatus/app/helmrelease.yaml5-6kubernetes/apps/observability/kube-prometheus-stack/app/resources/alertmanager.yaml40-53

Component Summary Table

ComponentRolePersistenceAccess
PrometheusMetric Aggregation50Gi openebs-hostpathkubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml109-112prometheus.cloudjur.com
LokiLog Management50Gi openebs-hostpathkubernetes/apps/observability/loki/app/helmrelease.yaml80-81Grafana Datasource
GatusStatus MonitoringSQLite volsync backed kubernetes/apps/observability/gatus/app/resources/config.yaml5-9gatus.cloudjur.com
AlertmanagerAlert Routing1Gi local-hostpathkubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml50-53alertmanager.cloudjur.com

Sources:kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml27-112kubernetes/apps/observability/gatus/app/helmrelease.yaml125-172kubernetes/apps/observability/loki/app/helmrelease.yaml35-81