Skip to content

backup_recovery.png

Backup and Recovery Architecture

DRP provides a comprehensive backup and recovery system that enables automated backup operations and recovery across both single endpoints and High Availability (HA) clusters. Please note that this is only a backup of DRP and its contents. The backup system assumes a running DRP endpoint from which backup commands can be run and backups can be stored and a running DRP on the endpoints being backed up. A DRP and content version of at v4.15 is required on all involved endpoints.

This section focuses on the architecture of the backup and recovery system. The operational instructions can be found at Backup and Recovery Tutorial.

The backup and recovery system provides three primary functions:

  • Automated backup scheduling and execution across multiple DRP endpoints
  • Restore operations with support for both single nodes and HA cluster reconstruction
  • Centralized backup management with retention policies, protection mechanisms, and validation

System Components

The backup and recovery architecture consists of several key components that work together to provide comprehensive data protection:

Backup Manager

The Backup Manager is a DRP endpoint that is configured to manage backup and recovery operations for other DRP endpoints. It serves as the central coordination point for all backup and restore activities.

Key responsibilities:

  • Maintains backup schedules and triggers automated backups
  • Stores backup archives and manages retention policies
  • Coordinates restore operations across target endpoints
  • Manages backup runner machines and restore agent deployments

Note

A single DRP endpoint can function as a backup manager while also serving other roles (endpoint manager, content server, etc.).

Backup Server

A Backup Server is a specialized Resource Broker, uniquely configured for a specific DRP endpoint being backed up. There is a one-to-one mapping between backup servers and target endpoints. For every backup server, one or more Backup Runner machines are associated to provide direct connectivity to the endpoint nodes. In single-node setups, a single backup runner is used. In HA cluster scenarios, one backup runner is created per node in the cluster. This structure ensures that each endpoint has exactly one backup server and n runners, where n corresponds to the number of nodes in that endpoint.

Each backup server maintains the following key information:

  • Endpoint IP and credentials to communicate with the restore agent running on each endpoint node
  • References to the runner machines on each node
  • List of backups taken for the backup server
  • Backup and restore state/readiness

Backup Runner Machines

Backup Runner Machines are machines created on the backup manager endpoint that maintain persistent connections to target endpoints. Each target endpoint node has a corresponding backup runner machine.

Architecture mapping:

  • Single endpoint: 1 backup runner machine
  • HA cluster with N nodes: N backup runner machines (one per cluster node)

Backup runner machines serve as:

  • A connection between the backup manager and the target endpoint(s)
  • Workorder owners for all restore operations on target endpoint(s)

Each backup runner maintains the following key information:

  • Target endpoint node IP address and identifier
  • HA state data for cluster reconstruction
  • Connection status to corresponding restore agents

Restore Agents

Restore Agents (drp-restore-agent) run on target endpoint nodes and provide the execution environment for restore operations. These agents:

  • Maintain persistent connections to their corresponding backup runners
  • Execute restore operations locally on the target endpoint
  • Remain active even when the DRP endpoint is down, enabling disaster recovery
  • Handle backup archive download, extraction, and service coordination

Backup and Restore Process

For detailed instructions about how to set up, backup and restore a DRP endpoint, please refer to the Operator Instructions

State Management

The backup and recovery system maintains state tracking to ensure reliable operations:

Backup Server States

Backup server state is dictated by the backup/backup-server-state param:

  • initializing: Backup server setup in progress
  • agents-deployed: Restore agents successfully deployed to all target nodes
  • manual-intervention: Manual restore agent installation required
  • ready: System ready for backup and restore operations

Backup Operation States

Backup operation state is dictated by the backup/backup-state param:

  • in-progress: Backup operation currently executing
  • completed: Backup operation finished successfully
  • failed: Backup operation encountered errors

Restore Operation States

Restore operation state is dictated by the backup/restore-state param:

  • initiated: Restore operation started, backup file selected
  • file-staged: Backup file available on fileserver for download
  • restoring-nodes: Download and extraction in progress on target nodes
  • restoring-nodes-complete: All nodes have completed data restoration
  • rebuilding-cluster: HA cluster reconstruction in progress (HA only)
  • rebuilding-cluster-complete: HA cluster successfully rebuilt (HA only)
  • restarting-services: DRP services restarting on all nodes
  • completed: Restore operation finished successfully
  • failed: Restore operation encountered errors

Storage Architecture

The backup and recovery system stores state and backups in a reliable, structured way.

Archive Organization

Backup archives are organized in a hierarchical structure on the backup manager endpoint:

/var/lib/drp-backups/
├── endpoint-id-1/
│   ├── archives/
│   │   ├── backup-endpoint-1-2025-07-17-134816.tgz
│   │   ├── backup-endpoint-1-2025-07-16-134816.tgz
│   │   └── ...
│   └── backups/
│       └── latest-uncompressed/
└── endpoint-id-2/
    ├── archives/
    └── backups/

Backup location is dictated by the backup/destination-dir-base param.

Archive Scheduling

The system supports flexible archive creation based on configurable schedules:

  • Always: Create archive for every backup operation
  • Daily: Create archive once per day at specified hour
  • Weekly: Create archive once per week on specified day
  • Monthly: Create archive once per month on specified day

Archive scheduling is dictated by the backup/archive-schedule param.

Retention Management

Backup retention is controlled through configurable policies:

  • Retention count: Maximum number of backups to maintain, dictated by backup/retention-count
  • Protected backups: Archives exempt from automatic deletion
    • Backups can be marked as "protected" in the UX (see tutorial)
  • Space monitoring: Automatic alerts when storage approaches capacity

Backup Contents

Backup contents include the following data by default:

  • DRP object data
  • Jobs, work orders, and other logs

By setting backup/include-artifacts to true on a backup server, more data can be included:

  • Content packs
  • Plugin providers
  • Contents of the files directory
  • Contents of the isos directory

Including artifacts will make the initial backup process take longer and create larger backup files, but ensures a more complete backup that can be used for full system restoration.

Network Considerations

Reliable network connectivity is required for all backup and restore operations. The system assumes uninterrupted, bidirectional communication between the backup manager and all downstream endpoints.

Requirements:

  • Persistent connections must be maintained between backup runner machines and restore agents - this is maintained by keeping a stable connection between the backup manager and the downstream endpoints.
  • The backup manager must be able to transfer files, including backup archives and scripts, to each downstream endpoint.
  • HA cluster nodes must have direct connectivity to each other for cluster coordination.
  • Any instability or restriction in the network path may result in backup or restore failure.

Operational Constraints and Caveats

Important

If an endpoint changes configuration (e.g., from a single-node setup to an HA cluster), you must manually remove and re-add the associated backup server and runners. The system does not automatically detect or adapt to structural changes in endpoints.

Important

If restore agent deployment fails due to permission issues (e.g., lack of root), manual script installation is required using the value of backup/restore-agent-script-generator.

Important

DRP should already be running on the endpoint we are backing up. Any changes in DRP versions are not recorded. When DRP is restarted during restore operations - the existing dr-provision binary is used to restart DRP.

Important

The backup system only captures DRP-managed data. External system configurations (e.g., firewall rules, certificates in non-standard locations) must be backed up separately.

Important

Retention policies (backup/retention-count) will remove old backups unless they are explicitly marked as protected. Always mark critical backups to avoid unintended deletion.