Troubleshooting Runner does not connect
I have booted a machine (or used the join-up script) but the machine is not showing as runnable (aka Ready) or not being created during discover.
I've confirmed that the machine is booted/provisioned, but Digital Rebar does not connecting.
That generally means that the agent/runner is not starting.
There are number of reasons why the DRP runner is not able to start on a system. This article will help identify where to look for logs. Generally, details in the logs are sufficient to identify the failure to start or connect reason.
Depending on your machine registration approach, there are differences between discovery and join-up processes. While some suggestions here may not apply to your specific case, we've included several troubleshooting scenarios for completeness.
Note: We will be troubelshooting from the machine back towards Digital Rebar server, so these instructions assume that you can log into the machine(s) in question.
Did Discovery Start control.sh?¶
For Sledgehammer and netboot installation, Digital Rebar injects values into the bootstrap process.
cat /proc/cmdline to ensure that the
provisioner.web values are being set.
Note: join-up systems will not have specialized settings.
Can DRPCLI connect?¶
One of the first things that the runner will check is access to the
system. Test access to the DRP server by using
drpcli -E [https://endpoint IP] -P [password] machines whoami.
Success with whoami confirms connectivity to the DRP server and also shows how the server identifies the machine in question.grep
Did discovery-common-bootstrap.sh get created?¶
One of the first things creating during runner bootstrap is the
/tmp/discovery-common-bootstrap.sh file. Check to see if that file
If it is missing, then the basic discovery process failed. You will need to review how your BootEnv was created. Additional details may be in the Digital Rebar server logs.
Is control.sh available?¶
During discovery or join-up of
known machines machines will
control.sh bootstrap script from the DRP Server from
http://[drp ip]:8091/machines/[machine uuid]/control.sh. If this
file is not avialable at this URL then the process will fail.
This will happen if the machine is using
It can also happen if the DRP Unknown Workflow/Stage/BootEnv values are
Did control.sh or join-up.sh start?¶
When attaching discovered or netbooted machines, the system uses
control.sh to connect.
When attaching to existing machines, such as cloud joins, the system
join-up.sh to connect.
In many cases, the machine\'s UUID is already presentnon the system as
/etc/rs-uuid. Check to see if this file exists and contains the
join-up.sh script by simply running it
on the system and watching the results.
Is the runner starting then failing?¶
If the runner is starting and then failing, then the runner logs will provide helpful information.
Either review configuration at
/var/lib/drp-agent or try
journalctl -u sledgehammer
Did Jobs start running?¶
In some cases, the runner starts but then fails. This may indiciate a configuration or resource problem on the machine.
If the runner was able to start, then jobs lobs will be created under
/tmp/runner-* on the system. You also should be able to find matching
logs on the Digital Rebar job log entries.
What about DRP Server logs?¶
While runner connect issues are generally machine related. DRP Server logs (or lack of them) can be a helpful diagnostic tool.
journalctl -u dr-provision to observe the logs.
Additional resources and information related to this Knowledge Base article.
For more information on join-up.sh, see the
runner, agent, does not start, not runnable