Hardware Burn-in¶

This diagram provides a high-level description of burnin.

burnin image

What Is Hardware Burn-in?¶

Hardware burn-in is the process of subjecting newly provisioned servers to sustained, intensive workloads before they are placed into production service. The goal is to surface latent hardware defects — marginal memory modules, failing drives, overheating CPUs, or unreliable network adapters — while the machine is still under vendor warranty and before it carries production traffic. A server that passes burn-in has been demonstrated to be stable under load and thermally within spec, giving operators significantly higher confidence in its reliability.

Burn-in is particularly important in large-scale deployments where individual server acceptance testing is impractical. By automating the process through DRP, hundreds of machines can be burned in concurrently without manual intervention, and results are collected centrally for audit and disposition.

How DRP Orchestrates Burn-in¶

DRP drives burn-in as a workflow phase within the machine lifecycle, typically positioned after hardware configuration (RAID, BIOS, firmware) and before OS installation. The burn-in workflow uses the burnin content pack, which provides stages and tasks that run stress tools against CPU, memory, storage, and network subsystems.

The burn-in workflow runs inside Sledgehammer (the DRP discovery OS), which means the machine boots from the network and executes entirely in RAM. No disk state is modified during burn-in, allowing the same system to proceed to OS installation immediately after passing. The burn-in content pack tasks:

Stress the CPU — run multi-threaded compute workloads at 100% utilization for a configurable duration to detect thermal throttling, core failures, and clock instability.
Test memory — run memory test passes (typically memtest or a kernel-based stress tool) to identify bit errors, address line failures, and ECC correction rates.
Exercise storage — run sequential and random read/write patterns against all detected drives to identify high error rates or failing sectors.

Pass/Fail Criteria and Result Collection¶

Each burn-in task records its results as DRP parameters on the Machine object. Pass/fail thresholds — duration, error rate limits, minimum throughput — are configurable via parameters and can be set at the machine, profile, or global level. This allows different hardware classes to have different acceptance criteria (for example, storage-optimized nodes may have stricter disk requirements).

When a machine fails a burn-in test, the task exits with a failure code that stops the workflow and marks the machine as failed and not runnable. The failure reason is recorded in the job log and as a parameter on the Machine object. Operators can then inspect the machine, replace hardware, and re-run the burn-in workflow without re-running the earlier hardware configuration stages — the parameter flags that track completion of each phase ensure idempotent re-execution.

Machines that pass all burn-in tests have their workflow advanced to the OS installation phase. The burn-in results remain attached to the Machine object as a historical record, accessible via the DRP API and portal.