Skip to content

Troubleshooting Runner does not connect

I have booted a machine (or used the join-up script) but the machine is not showing as runnable (aka Ready) or not being created during discover.

I've confirmed that the machine is booted/provisioned, but Digital Rebar does not connecting.

Solution

That generally means that the agent/runner is not starting.

There are number of reasons why the DRP runner is not able to start on a system. This article will help identify where to look for logs. Generally, details in the logs are sufficient to identify the failure to start or connect reason.

Depending on your machine registration approach, there are differences between discovery and join-up processes. While some suggestions here may not apply to your specific case, we've included several troubleshooting scenarios for completeness.

Note: We will be troubelshooting from the machine back towards Digital Rebar server, so these instructions assume that you can log into the machine(s) in question.

Did Discovery Start control.sh?

For Sledgehammer and netboot installation, Digital Rebar injects values into the bootstrap process.

Use cat /proc/cmdline to ensure that the BOOTIF and provisioner.web values are being set.

Note: join-up systems will not have specialized settings.

Can DRPCLI connect?

One of the first things that the runner will check is access to the system. Test access to the DRP server by using drpcli -E [https://endpoint IP] -P [password] machines whoami.

Success with whoami confirms connectivity to the DRP server and also shows how the server identifies the machine in question.grep

Did discovery-common-bootstrap.sh get created?

One of the first things creating during runner bootstrap is the /tmp/discovery-common-bootstrap.sh file. Check to see if that file exists.

If it is missing, then the basic discovery process failed. You will need to review how your BootEnv was created. Additional details may be in the Digital Rebar server logs.

Is control.sh available?

During discovery or join-up of known machines machines will download the control.sh bootstrap script from the DRP Server from http://[drp ip]:8091/machines/[machine uuid]/control.sh. If this file is not avialable at this URL then the process will fail.

This will happen if the machine is using local or ignore bootenvs. It can also happen if the DRP Unknown Workflow/Stage/BootEnv values are not set.

Did control.sh or join-up.sh start?

When attaching discovered or netbooted machines, the system uses control.sh to connect.

When attaching to existing machines, such as cloud joins, the system uses join-up.sh to connect.

In many cases, the machine\'s UUID is already presentnon the system as /etc/rs-uuid. Check to see if this file exists and contains the correct UUID.

Test the /tmp/control.sh or join-up.sh script by simply running it on the system and watching the results.

Is the runner starting then failing?

If the runner is starting and then failing, then the runner logs will provide helpful information.

Either review configuration at /var/lib/drp-agent or try journalctl -u sledgehammer

Did Jobs start running?

In some cases, the runner starts but then fails. This may indiciate a configuration or resource problem on the machine.

If the runner was able to start, then jobs lobs will be created under /tmp/runner-* on the system. You also should be able to find matching logs on the Digital Rebar job log entries.

What about DRP Server logs?

While runner connect issues are generally machine related. DRP Server logs (or lack of them) can be a helpful diagnostic tool.

Use journalctl -u dr-provision to observe the logs.

Additional Information

Additional resources and information related to this Knowledge Base article.

See Also

For more information on join-up.sh, see the discover-joinup workflow

Versions

All

Keywords

runner, agent, does not start, not runnable

Revision Information

KB Article     :  kb-00055
initial release:  Wed 16 Dec 2020 11:15:57 AM CST
updated release:  Wed 16 Dec 2020 11:15:57 AM CST