Backup and Recovery Architecture¶

DRP provides a comprehensive backup and recovery system that enables automated backup operations and recovery across both single endpoints and High Availability (HA) clusters. Please note that this is only a backup of DRP and its contents. The backup system assumes a running DRP endpoint from which backup commands can be run and backups can be stored and a running DRP on the endpoints being backed up. A DRP and content version of at v4.15 is required on all involved endpoints.

This section focuses on the architecture of the backup and recovery system. The operational instructions can be found at Backup and Recovery Tutorial.

The backup and recovery system provides three primary functions:

Automated backup scheduling and execution across multiple DRP endpoints
Restore operations with support for both single nodes and HA cluster reconstruction
Centralized backup management with retention policies, protection mechanisms, and validation

System Components¶

The backup and recovery architecture consists of several key components that work together to provide comprehensive data protection:

Backup Manager¶

The Backup Manager is a DRP endpoint that is configured to manage backup and recovery operations for other DRP endpoints. It serves as the central coordination point for all backup and restore activities.

Key responsibilities:

Maintains backup schedules and triggers automated backups
Stores backup archives and manages retention policies
Coordinates restore operations across target endpoints
Manages backup runner machines and restore agent deployments

Note

A single DRP endpoint can function as a backup manager while also serving other roles (endpoint manager, content server, etc.).

Backup Server¶

A Backup Server is a specialized Resource Broker, uniquely configured for a specific DRP endpoint being backed up. There is a one-to-one mapping between backup servers and target endpoints. For every backup server, one or more Backup Runner machines are associated to provide direct connectivity to the endpoint nodes. In single-node setups, a single backup runner is used. In HA cluster scenarios, one backup runner is created per node in the cluster. This structure ensures that each endpoint has exactly one backup server and n runners, where n corresponds to the number of nodes in that endpoint.

Each backup server maintains the following key information:

Endpoint IP and credentials to communicate with the restore agent running on each endpoint node
References to the runner machines on each node
List of backups taken for the backup server
Backup and restore state/readiness

Backup Runner Machines¶

Backup Runner Machines are machines created on the backup manager endpoint that maintain persistent connections to target endpoints. Each target endpoint node has a corresponding backup runner machine.

Architecture mapping:

Single endpoint: 1 backup runner machine
HA cluster with N nodes: N backup runner machines (one per cluster node)

Backup runner machines serve as:

A connection between the backup manager and the target endpoint(s)
Workorder owners for all restore operations on target endpoint(s)

Each backup runner maintains the following key information:

Target endpoint node IP address and identifier
HA state data for cluster reconstruction
Connection status to corresponding restore agents

Restore Agents¶

Restore Agents (drp-restore-agent) run on target endpoint nodes and provide the execution environment for restore operations. These agents:

Maintain persistent connections to their corresponding backup runners
Execute restore operations locally on the target endpoint
Remain active even when the DRP endpoint is down, enabling disaster recovery
Handle backup archive download, extraction, and service coordination

Backup and Restore Process¶

For detailed instructions about how to set up, backup and restore a DRP endpoint, please refer to the Operator Instructions

State Management¶

The backup and recovery system maintains state tracking to ensure reliable operations:

Backup Server States¶

Backup server state is dictated by the backup/backup-server-state param:

initializing: Backup server setup in progress
agents-deployed: Restore agents successfully deployed to all target nodes
manual-intervention: Manual restore agent installation required
ready: System ready for backup and restore operations

Backup Operation States¶

Backup operation state is dictated by the backup/backup-state param:

in-progress: Backup operation currently executing
completed: Backup operation finished successfully
failed: Backup operation encountered errors

Restore Operation States¶

Restore operation state is dictated by the backup/restore-state param:

initiated: Restore operation started, backup file selected
file-staged: Backup file available on fileserver for download
restoring-nodes: Download and extraction in progress on target nodes
restoring-nodes-complete: All nodes have completed data restoration
rebuilding-cluster: HA cluster reconstruction in progress (HA only)
rebuilding-cluster-complete: HA cluster successfully rebuilt (HA only)
restarting-services: DRP services restarting on all nodes
completed: Restore operation finished successfully
failed: Restore operation encountered errors

Storage Architecture¶

The backup and recovery system stores state and backups in a reliable, structured way.

Archive Organization¶

Backup archives are organized in a hierarchical structure on the backup manager endpoint:

Text Only

/var/lib/drp-backups/
├── endpoint-id-1/
│   ├── archives/
│   │   ├── backup-endpoint-1-2025-07-17-134816.tgz
│   │   ├── backup-endpoint-1-2025-07-16-134816.tgz
│   │   └── ...
│   └── backups/
│       └── latest-uncompressed/
└── endpoint-id-2/
    ├── archives/
    └── backups/

Backup location is dictated by the backup/destination-dir-base param.

Archive Scheduling¶

The system supports flexible archive creation based on configurable schedules:

Always: Create archive for every backup operation
Daily: Create archive once per day at specified hour
Weekly: Create archive once per week on specified day
Monthly: Create archive once per month on specified day

Archive scheduling is dictated by the backup/archive-schedule param.

Retention Management¶

Backup retention is controlled through configurable policies:

Retention count: Maximum number of backups to maintain, dictated by backup/retention-count
Protected backups: Archives exempt from automatic deletion
- Backups can be marked as "protected" in the UX (see tutorial)
Space monitoring: Automatic alerts when storage approaches capacity

Backup Contents¶

Backup contents include the following data by default:

DRP object data
Jobs, work orders, and other logs

By setting backup/include-artifacts to true on a backup server, more data can be included:

Content packs
Plugin providers
Contents of the files directory
Contents of the isos directory

Including artifacts will make the initial backup process take longer and create larger backup files, but ensures a more complete backup that can be used for full system restoration.

Network Considerations¶

Reliable network connectivity is required for all backup and restore operations. The system assumes uninterrupted, bidirectional communication between the backup manager and all downstream endpoints.

Requirements:

Persistent connections must be maintained between backup runner machines and restore agents - this is maintained by keeping a stable connection between the backup manager and the downstream endpoints.
The backup manager must be able to transfer files, including backup archives and scripts, to each downstream endpoint.
HA cluster nodes must have direct connectivity to each other for cluster coordination.
Any instability or restriction in the network path may result in backup or restore failure.

Operational Constraints and Caveats¶

Important

If an endpoint changes configuration (e.g., from a single-node setup to an HA cluster), you must manually remove and re-add the associated backup server and runners. The system does not automatically detect or adapt to structural changes in endpoints.

Important

If restore agent deployment fails due to permission issues (e.g., lack of root), manual script installation is required using the value of backup/restore-agent-script-generator.

Important

DRP should already be running on the endpoint we are backing up. Any changes in DRP versions are not recorded. When DRP is restarted during restore operations - the existing dr-provision binary is used to restart DRP.

Important

The backup system only captures DRP-managed data. External system configurations (e.g., firewall rules, certificates in non-standard locations) must be backed up separately.

Important

Retention policies (backup/retention-count) will remove old backups unless they are explicitly marked as protected. Always mark critical backups to avoid unintended deletion.