Troubleshooting Runner does not connect
I have booted a machine (or used the join-up script) but the machine is not showing as runnable (aka Ready) or not being created during discover.
I've confirmed that the machine is booted/provisioned, but Digital Rebar does not connecting.
Solution¶
That generally means that the agent/runner is not starting.
There are number of reasons why the DRP runner is not able to start on a system. This article will help identify where to look for logs. Generally, details in the logs are sufficient to identify the failure to start or connect reason.
Depending on your machine registration approach, there are differences between discovery and join-up processes. While some suggestions here may not apply to your specific case, we've included several troubleshooting scenarios for completeness.
Note: We will be troubelshooting from the machine back towards Digital Rebar server, so these instructions assume that you can log into the machine(s) in question.
Did Discovery Start control.sh?¶
For Sledgehammer and netboot installation, Digital Rebar injects values into the bootstrap process.
Use cat /proc/cmdline
to ensure that the BOOTIF
and
provisioner.web
values are being set.
Note: join-up systems will not have specialized settings.
Can DRPCLI connect?¶
One of the first things that the runner will check is access to the
system. Test access to the DRP server by using
drpcli -E [https://endpoint IP] -P [password] machines whoami
.
Success with whoami confirms connectivity to the DRP server and also shows how the server identifies the machine in question.grep
Did discovery-common-bootstrap.sh get created?¶
One of the first things creating during runner bootstrap is the
/tmp/discovery-common-bootstrap.sh
file. Check to see if that file
exists.
If it is missing, then the basic discovery process failed. You will need to review how your BootEnv was created. Additional details may be in the Digital Rebar server logs.
Is control.sh available?¶
During discovery or join-up of known
machines machines will
download the control.sh
bootstrap script from the DRP Server from
http://[drp ip]:8091/machines/[machine uuid]/control.sh. If this
file is not avialable at this URL then the process will fail.
This will happen if the machine is using local
or ignore
bootenvs.
It can also happen if the DRP Unknown Workflow/Stage/BootEnv values are
not set.
Did control.sh or join-up.sh start?¶
When attaching discovered or netbooted machines, the system uses
control.sh
to connect.
When attaching to existing machines, such as cloud joins, the system
uses join-up.sh
to connect.
In many cases, the machine\'s UUID is already presentnon the system as
/etc/rs-uuid
. Check to see if this file exists and contains the
correct UUID.
Test the /tmp/control.sh
or join-up.sh
script by simply running it
on the system and watching the results.
Is the runner starting then failing?¶
If the runner is starting and then failing, then the runner logs will provide helpful information.
Either review configuration at /var/lib/drp-agent
or try
journalctl -u sledgehammer
Did Jobs start running?¶
In some cases, the runner starts but then fails. This may indiciate a configuration or resource problem on the machine.
If the runner was able to start, then jobs lobs will be created under
/tmp/runner-*
on the system. You also should be able to find matching
logs on the Digital Rebar job log entries.
What about DRP Server logs?¶
While runner connect issues are generally machine related. DRP Server logs (or lack of them) can be a helpful diagnostic tool.
Use journalctl -u dr-provision
to observe the logs.
Additional Information¶
Additional resources and information related to this Knowledge Base article.
See Also¶
For more information on join-up.sh, see the discover-joinup
workflow
Versions¶
All
Keywords¶
runner, agent, does not start, not runnable