Automated Remediation and Webhooks

Relevant source files

The automated remediation system in this repository provides a self-healing mechanism for common cluster operational issues, specifically focusing on VolSync backup failures. The system leverages Prometheus alerts to trigger targeted Kubernetes Jobs that resolve stale locks or hung snapshots without manual intervention.

System Architecture and Data Flow

The remediation pipeline operates as an event-driven loop starting from the observability stack and ending with a corrective action in the application namespace.

Remediation Data Flow

  1. Alerting: Prometheus detects a VolSyncVolumeOutOfSync alert.
  2. Webhook Trigger: Alertmanager sends a POST request to the remediation-webhook service.
  3. Job Creation: The webhook executes volsync-remediation.sh, which uses kubectl to create a one-off Job from the remediation CronJob template.
  4. Analysis: The remediation Job queries the Prometheus API to identify the specific namespace and object causing the alert.
  5. Action: The Job executes logic to unlock Restic repositories, patch finalizers, and trigger manual syncs.

Component Relationship Diagram

This diagram maps the natural language flow to specific code entities and files.

“Remediation System Map”

[Flowchart Diagram]

Sources: kubernetes/apps/observability/webhook/app/helmrelease.yaml5-20kubernetes/apps/observability/webhook/app/resources/volsync-remediation.sh1-15kubernetes/apps/jobs/remediation/app/helmrelease.yaml21-42


Observability Webhook Server

The remediation-webhook is a specialized receiver for Alertmanager notifications. It is deployed using the bjw-s/app-template and is designed to interact with the Kubernetes API from within the observability namespace.

Implementation Details

Sources: kubernetes/apps/observability/webhook/app/helmrelease.yaml1-119kubernetes/apps/observability/webhook/app/resources/volsync-remediation.sh1-15


VolSync Remediation Logic

The core remediation logic resides in a CronJob defined in the jobs namespace. While defined as a CronJob, its schedule is set to “Feb 30th” (0 0 30 2 *), effectively making it a manual or webhook-only trigger kubernetes/apps/jobs/remediation/app/helmrelease.yaml24-25

VolSync Recovery Process

The remediation script performs the following sequence for every alert labeled VolSyncVolumeOutOfSync:

  1. State Verification: It checks for failed pods in the volsync-src-${name} job. It only proceeds if there are Failed or Error pods and zeroRunning or Succeeded pods to avoid interrupting active syncs kubernetes/apps/jobs/remediation/app/helmrelease.yaml55-75
  2. Restic Unlocking: If a lock is suspected, it creates a temporary Job volsync-unlock-${app_name}-r2 using the restic/restic image to run unlock --remove-all. This job inherits environment variables from the application’s VolSync secret kubernetes/apps/jobs/remediation/app/helmrelease.yaml82-104
  3. Finalizer Cleanup: It patches the VolumeSnapshot named volsync-$name-src to remove finalizers and deletes the snapshot to clear hung states kubernetes/apps/jobs/remediation/app/helmrelease.yaml115-116
  4. Manual Trigger: Finally, it patches the ReplicationSource with a manual trigger timestamp to restart the backup immediately kubernetes/apps/jobs/remediation/app/helmrelease.yaml112-113

Logic Entity Diagram

This diagram illustrates the internal functions and logic branches of the remediation script.

“Remediation Script Logic Flow”

[Flowchart Diagram]

Sources: kubernetes/apps/jobs/remediation/app/helmrelease.yaml41-118


Jobs Namespace and Pattern

The jobs namespace serves as a central execution environment for administrative tasks. It is managed via a Flux Kustomization that pulls from kubernetes/apps/jobs/remediation/appkubernetes/apps/jobs/remediation/ks.yaml1-22

Remediation HelmRelease Pattern

The remediation job uses the app-template with a specific configuration for high-reliability administrative tasks:

The remediation system interacts with standard VolSync components defined in the repository:

Sources: kubernetes/apps/jobs/remediation/app/helmrelease.yaml139-165kubernetes/components/volsync/r2.yaml1-17kubernetes/apps/volsync-system/volsync/app/mutatingadmissionpolicy.yaml1-53