Automated Remediation and Webhooks
Relevant source files
- docs/ai-context/SCHEDULES.md
- kubernetes/apps/jobs/kustomization.yaml
- kubernetes/apps/jobs/remediation/app/helmrelease.yaml
- kubernetes/apps/jobs/remediation/app/kustomization.yaml
- kubernetes/apps/jobs/remediation/ks.yaml
- kubernetes/apps/observability/webhook/app/helmrelease.yaml
- kubernetes/apps/observability/webhook/app/kustomization.yaml
- kubernetes/apps/observability/webhook/app/resources/hooks.yaml
- kubernetes/apps/observability/webhook/app/resources/volsync-remediation.sh
- kubernetes/apps/observability/webhook/ks.yaml
- kubernetes/apps/volsync-system/volsync/app/mutatingadmissionpolicy.yaml
- kubernetes/apps/volsync-system/volsync/maintenance/externalsecret.yaml
- kubernetes/apps/volsync-system/volsync/maintenance/kopiamaintenance.yaml
- kubernetes/apps/volsync-system/volsync/maintenance/kustomization.yaml
- kubernetes/apps/volsync-system/volsync/maintenance/tbd.yaml
- kubernetes/components/volsync/kopia.yaml
- kubernetes/components/volsync/pvc.yaml
- kubernetes/components/volsync/r2.yaml
The automated remediation system in this repository provides a self-healing mechanism for common cluster operational issues, specifically focusing on VolSync backup failures. The system leverages Prometheus alerts to trigger targeted Kubernetes Jobs that resolve stale locks or hung snapshots without manual intervention.
System Architecture and Data Flow
The remediation pipeline operates as an event-driven loop starting from the observability stack and ending with a corrective action in the application namespace.
Remediation Data Flow
- Alerting: Prometheus detects a
VolSyncVolumeOutOfSyncalert. - Webhook Trigger: Alertmanager sends a POST request to the
remediation-webhookservice. - Job Creation: The webhook executes
volsync-remediation.sh, which useskubectlto create a one-off Job from theremediationCronJob template. - Analysis: The
remediationJob queries the Prometheus API to identify the specific namespace and object causing the alert. - Action: The Job executes logic to unlock Restic repositories, patch finalizers, and trigger manual syncs.
Component Relationship Diagram
This diagram maps the natural language flow to specific code entities and files.
“Remediation System Map”
[Flowchart Diagram]
Sources: kubernetes/apps/observability/webhook/app/helmrelease.yaml5-20kubernetes/apps/observability/webhook/app/resources/volsync-remediation.sh1-15kubernetes/apps/jobs/remediation/app/helmrelease.yaml21-42
Observability Webhook Server
The remediation-webhook is a specialized receiver for Alertmanager notifications. It is deployed using the bjw-s/app-template and is designed to interact with the Kubernetes API from within the observability namespace.
Implementation Details
- Kubernetes Integration: The pod includes an
initContainernamedcopy-kubectlthat extracts thekubectlbinary from a utility image into a sharedemptyDirvolume at/kubectlkubernetes/apps/observability/webhook/app/helmrelease.yaml24-32 - RBAC: The service account is granted
ClusterRolepermissions tocreatejobs andgettheremediationcronjob in thejobsnamespace kubernetes/apps/observability/webhook/app/helmrelease.yaml95-107 - Script Execution: When triggered, it runs
volsync-remediation.sh. This script generates a uniqueJOB_NAMEusing a timestamp and executes/kubectl/kubectl create job "${JOB_NAME}" --from=cronjob/remediation -n jobskubernetes/apps/observability/webhook/app/resources/volsync-remediation.sh5-12
Sources: kubernetes/apps/observability/webhook/app/helmrelease.yaml1-119kubernetes/apps/observability/webhook/app/resources/volsync-remediation.sh1-15
VolSync Remediation Logic
The core remediation logic resides in a CronJob defined in the jobs namespace. While defined as a CronJob, its schedule is set to “Feb 30th” (0 0 30 2 *), effectively making it a manual or webhook-only trigger kubernetes/apps/jobs/remediation/app/helmrelease.yaml24-25
VolSync Recovery Process
The remediation script performs the following sequence for every alert labeled VolSyncVolumeOutOfSync:
- State Verification: It checks for failed pods in the
volsync-src-${name}job. It only proceeds if there areFailedorErrorpods and zeroRunningorSucceededpods to avoid interrupting active syncs kubernetes/apps/jobs/remediation/app/helmrelease.yaml55-75 - Restic Unlocking: If a lock is suspected, it creates a temporary Job
volsync-unlock-${app_name}-r2using therestic/resticimage to rununlock --remove-all. This job inherits environment variables from the application’s VolSync secret kubernetes/apps/jobs/remediation/app/helmrelease.yaml82-104 - Finalizer Cleanup: It patches the
VolumeSnapshotnamedvolsync-$name-srcto remove finalizers and deletes the snapshot to clear hung states kubernetes/apps/jobs/remediation/app/helmrelease.yaml115-116 - Manual Trigger: Finally, it patches the
ReplicationSourcewith a manual trigger timestamp to restart the backup immediately kubernetes/apps/jobs/remediation/app/helmrelease.yaml112-113
Logic Entity Diagram
This diagram illustrates the internal functions and logic branches of the remediation script.
“Remediation Script Logic Flow”
[Flowchart Diagram]
Sources: kubernetes/apps/jobs/remediation/app/helmrelease.yaml41-118
Jobs Namespace and Pattern
The jobs namespace serves as a central execution environment for administrative tasks. It is managed via a Flux Kustomization that pulls from kubernetes/apps/jobs/remediation/appkubernetes/apps/jobs/remediation/ks.yaml1-22
Remediation HelmRelease Pattern
The remediation job uses the app-template with a specific configuration for high-reliability administrative tasks:
-
RBAC Permissions: The job requires broad permissions across multiple API groups to perform its duties:
-
volsync.backube:replicationsources(get, list, patch) kubernetes/apps/jobs/remediation/app/helmrelease.yaml145-147 -
snapshot.storage.k8s.io:volumesnapshots(delete, get, list, patch) kubernetes/apps/jobs/remediation/app/helmrelease.yaml148-150 -
batch:jobs(create, get, list, delete) kubernetes/apps/jobs/remediation/app/helmrelease.yaml151-153 -
Security Context: Despite its high privilege in the Kubernetes API, the container runs with a restricted security context:
readOnlyRootFilesystem: trueand all capabilities dropped kubernetes/apps/jobs/remediation/app/helmrelease.yaml119-122
Related VolSync Components
The remediation system interacts with standard VolSync components defined in the repository:
- Kopia/Restic Configuration: Secrets like
${APP}-volsync-r2provide the necessary credentials for the unlock jobs kubernetes/components/volsync/r2.yaml1-17 - Admission Policies: A
MutatingAdmissionPolicynamedvolsync-mover-jitteradds a random sleep (0-30s) to VolSync jobs to prevent simultaneous snapshot requests, reducing the likelihood of the errors the remediation system is designed to fix kubernetes/apps/volsync-system/volsync/app/mutatingadmissionpolicy.yaml10-53
Sources: kubernetes/apps/jobs/remediation/app/helmrelease.yaml139-165kubernetes/components/volsync/r2.yaml1-17kubernetes/apps/volsync-system/volsync/app/mutatingadmissionpolicy.yaml1-53