Troubleshooting Guide¶

Common Issues¶

Installation Failures¶

Check API server health:

Bash

# Test API server health
curl -k https://api.demo.k8s.local:6443/healthz

# Verify API server version
curl -k https://api.demo.k8s.local:6443/version

Check node and machine status:

Bash

oc get nodes
oc get machines
oc describe node <node-name>

Review events:

Bash

oc get events --sort-by='.metadata.creationTimestamp'

Examine operator status:

Bash

oc get clusteroperators
oc describe co <operator-name>

Check machine configuration:

Bash

oc get pods -n openshift-machine-config-operator
oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-server

Check installation logs:

Bash

openshift-install gather bootstrap --dir /root/cluster

Network Issues¶

Verify DNS resolution:

Bash

# Check API server resolution
dig api.demo.k8s.local +short

# Check internal API server resolution
dig api-int.demo.k8s.local +short

# Check application wildcard DNS
dig *.apps.demo.k8s.local +short

Check pod networking:

Bash

oc get pods -n openshift-sdn
oc logs -n openshift-sdn -l app=sdn
oc get network.config.openshift.io cluster -o yaml

Review service endpoints:
Bash
```
oc get endpoints -A
oc get svc -A
```

Test network connectivity:

Bash

oc debug node/<node-name> -- chroot /host ip addr show
oc debug node/<node-name> -- chroot /host ping <target-ip>

Resource Constraints¶

Check resource usage:

Bash

oc adm top nodes
oc adm top pods --containers=true --all-namespaces

Review quota usage:

Bash

oc get resourcequota -A
oc describe quota -n <namespace>

Monitor storage:

Bash

oc get pv,pvc --all-namespaces
oc get volumeattachment

Certificate Issues¶

Check certificate status:

Bash

oc get csr
oc get secret -n openshift-config

Review certificate expiration:

Bash

oc get secret -n openshift-kube-apiserver-operator kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'

Verify API server certificates:
Bash
```
oc get apiserver cluster -o yaml
```

Authentication and Authorization¶

Check identity provider configuration:

Bash

oc get oauth cluster -o yaml
oc get identity

Review role bindings:

Bash

oc get clusterrolebinding
oc get rolebinding --all-namespaces

Registry Issues¶

Check registry status:

Bash

oc get pods -n openshift-image-registry
oc get configs.imageregistry.operator.openshift.io cluster -o yaml

Review storage configuration:

Bash

oc get pvc -n openshift-image-registry
oc describe pvc -n openshift-image-registry

Collecting Diagnostics¶

Gather must-gather data:

Bash

# General cluster data
oc adm must-gather

# Specific component data
oc adm must-gather --image=registry.redhat.io/rhacm2/acm-must-gather-rhel8:v2.8

Review cluster logs:

Bash

# Control plane logs
oc logs -n openshift-controller-manager deployment/controller-manager

# Node logs
oc adm node-logs <node-name> -u kubelet

# Specific pod logs
oc logs -n <namespace> <pod-name> --previous

Export cluster state:

Bash

# Full cluster state
oc get all -A -o yaml > cluster-state.yaml

# Specific component state
oc get nodes -o yaml > nodes-state.yaml
oc get co -o yaml > operators-state.yaml

Check etcd health:

Bash

oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint health
oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint status -w table

Monitor API server metrics:

Bash

oc get --raw /metrics | grep apiserver_request_duration_seconds

DRP-Specific Troubleshooting¶

Check DRP machine status:
Bash
```
drpcli machines show <machine-uuid>
```

Examine task execution:

Bash

drpcli tasks status <task-uuid>
drpcli tasks logs <task-uuid>

API Health Verification¶

Test API server health:

Bash

# Test API server health directly
curl -k https://api.demo.k8s.local:6443/healthz

# Get API server version
curl -k https://api.demo.k8s.local:6443/version

Best Practices¶

Maintain cluster documentation including:
Network configuration
Storage layout
Authentication setup
Custom configurations
Implement systematic log collection and retention
Create and maintain runbooks for common issues
Document configuration changes and their rationale
Establish clear escalation paths for different types of issues