Skip to content

Discovery

Purpose

This document describes the architectural patterns and mechanisms by which Digital Rebar Provision (DRP) discovers and manages infrastructure resources. It clarifies the distinction between physical systems and their representation as managed machine objects within DRP.

Scope

Discovery in DRP is initiated through two primary entry points: PXE boot and manual join-up. The architecture ensures that all discovered machines are consistently inventoried, classified, and integrated into the DRP system.

Terminology

  • System: Refers to the physical or virtual resource (server, VM, or container) being managed.
  • Machine: The DRP object that represents and tracks the discovered system within the endpoint database.

Adding Machines to DRP Use Cases

Case Description Initiator BootEnv Typical Flow
1 Machine does not exist in DRP PXEboot from DRP defaultBootEnv Discovery is performed by sledgehammer (default)
2 Machine exists in DRP with bootenv PXEboot from DRP sledgehammer Updates existing machine
3 Machine exists in DRP and is an install bootenv PXEboot from DRP sledgehammer Agent install task run manually or via post-install scripts (esxi, kickstart/preseed, image-deploy), config injected, agent started
4 Machine is in DRP and boots locally BIOS/UEFI local Agent already installed as system service, runner starts, no new content pulled
5 Machine is not in DRP and running Join-up Script local User creates machines, installs agent, basic inventory is collected, agent waits for future control
6 Pre-configured machine in DRP runs join-up to start agent Join-up Script local User creates machines, installs agent for future control, requires correct HardwareAddrs, may fail if control.sh is missing

Details

Case 1: Machine does not exist in DRP and PXE boots - Machine enters discovery (sledgehammer, no machine-specific info) - Sledgehammer pulls start-up.sh (similar to join-up.sh), then common-bootstrap.sh - Creates machine, runs universal-discover - Switches to control.sh; assumes sledgehammer as target environment - Timing is consistent with other cases

Case 2: Machine exists in DRP with sledgehammer bootenv and PXE boots - Uses sledgehammer files and discovery (same as Case 1) - Updates machine instead of creating - Switches to control.sh

Case 3: Machine is in DRP and is an "install" bootenv - Agent install task run manually by installation process - ESXi: SSH in, pull Python code/config, start agent - Kickstart/Preseed: Post-install script pulls agent/config, starts agent - Image-deploy/Eikon: Post-install scripts injected for first boot - Bootenvs handle agent/config acquisition and startup - Common-bootstrap usually not used (machine info known, config injected)

Why use Case 3 with join-up? - Installation image methods may not allow per-machine config injection, but allow common config - Openshift allows service injection, not per-machine - Join-up service for known machine solves this - Could unify Cases 1-3 for simpler code paths

Case 4: Machine is in DRP and boots locally - Agent already installed as system service with config - Runner starts, runs existing configuration - No new content pulled

Case 5: Pre-configured machine in DRP runs join-up to start agent - User creates machines in DRP, installs agent for future control - Machines set to local bootenv - Requires Machine object with correct HardwareAddrs - May fail if control.sh (or equivalents) are missing

PXE Boot

The typical way a machine is discovered is by PXE booting to the sledgehammer boot environment and running tasks to identify the machine before running detailed inventory, classification and validation before driving the machine through application pipelines.

Process Overview

  1. The system is powered on and the BIOS or EFI is configured to boot from the network.
    1. The system and endpoint exchange DHCP DORA.
    2. A combination of TFTP and/or HTTP services on the endpoint along with preferences that set default workflow, stage, bootenv and unknown bootenv help drive decisions on what is served to the system. When installing DRP on the endpoint using --universal, the install sets the default preferences in a way that drives the system to use the discovery bootenv which includes the sledgehammer boot files.
  2. If the machine is known to the endpoint, the machine's bootenv is evaluated.
    • If it is not set, the endpoint's default bootenv preference is used.
    • If the machine is not known to the endpoint, the endpoint's unknown bootenv preference is used to serve the correct files.
  3. The machine PXE chainloads from the endpoint or gets files from http depending on the capabilities.
  4. Once booted to sledgehammer, an agent is brought up and the join-up.sh script is run.
  5. The join-up.sh script gathers information such as known UUID, memory, Cloud ID,serial number, mac addresses, and other system information, it then passes the information to the endpoint.
  6. The endpoint reviews what it knows of hardware, MACs, generates a score, and either returns that the system is known or not by providing the machine object including the appropriate API token if found.
    • If the system is not known, the join-up.sh script attempts to create the machine.
  7. After the join-up.sh script completes, the agent is updated, configured and started on the system. The agent is also called the machine runner.

Worflfows Associated with join-up and discovery

Universal Start Workflow

...

Universal Discovery Workflow

  1. The endpoint updates workflow, stages, tasks, and templates to run through the universal-discover workflow as defined by the default workflow preference.
  2. The machine runner takes offered files from endpoint file server, stages them locally and runs through them in the order defined on the machine object's tasklist. For the universal-discover workflow, the following occurs at a high level.
    1. start-callback
      • Interact with external services using basic API calls before the discover process is about to begin.
    2. pre-flexiflow
      • Inject tasks prior to discovery.
    3. inventory
      • Discover detailed information on the system including firmware, bios, raid, network and chassis information.
    4. classify
      • Use the rack plugin or custom query to help determine what the system is and how to drive the machine further towards end-state.
    5. post-flexiflow
      • inject tasks after discovery.
    6. validation
      • Validate the system using custom scripts. Currently, does nothing.
    7. complete-callback
      • Interact with external services using basic API after the discover process is complete.
    8. workflow-chain
      • Notify the endpoint to continue to the next, defined workflow.

Read More

Architectural Documentation

Operational Documentation

Reference Documentation