Skip to content

Troubleshooting Guide

Common Issues

Installation Failures

  1. Check API server health:

    Bash
    # Test API server health
    curl -k https://api.demo.k8s.local:6443/healthz
    
    # Verify API server version
    curl -k https://api.demo.k8s.local:6443/version
    

  2. Check node and machine status:

    Bash
    oc get nodes
    oc get machines
    oc describe node <node-name>
    

  3. Review events:

    Bash
    oc get events --sort-by='.metadata.creationTimestamp'
    

  4. Examine operator status:

    Bash
    oc get clusteroperators
    oc describe co <operator-name>
    

  5. Check machine configuration:

    Bash
    oc get pods -n openshift-machine-config-operator
    oc logs -n openshift-machine-config-operator -l k8s-app=machine-config-server
    

  6. Check installation logs:

    Bash
    openshift-install gather bootstrap --dir /root/cluster
    

Network Issues

  1. Verify DNS resolution:

    Bash
    # Check API server resolution
    dig api.demo.k8s.local +short
    
    # Check internal API server resolution
    dig api-int.demo.k8s.local +short
    
    # Check application wildcard DNS
    dig *.apps.demo.k8s.local +short
    

  2. Check pod networking:

    Bash
    oc get pods -n openshift-sdn
    oc logs -n openshift-sdn -l app=sdn
    oc get network.config.openshift.io cluster -o yaml
    

  3. Review service endpoints:

    Bash
    oc get endpoints -A
    oc get svc -A
    

  4. Test network connectivity:

    Bash
    oc debug node/<node-name> -- chroot /host ip addr show
    oc debug node/<node-name> -- chroot /host ping <target-ip>
    

Resource Constraints

  1. Check resource usage:

    Bash
    oc adm top nodes
    oc adm top pods --containers=true --all-namespaces
    

  2. Review quota usage:

    Bash
    oc get resourcequota -A
    oc describe quota -n <namespace>
    

  3. Monitor storage:

    Bash
    oc get pv,pvc --all-namespaces
    oc get volumeattachment
    

Certificate Issues

  1. Check certificate status:

    Bash
    oc get csr
    oc get secret -n openshift-config
    

  2. Review certificate expiration:

    Bash
    oc get secret -n openshift-kube-apiserver-operator kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'
    

  3. Verify API server certificates:

    Bash
    oc get apiserver cluster -o yaml
    

Authentication and Authorization

  1. Check identity provider configuration:

    Bash
    oc get oauth cluster -o yaml
    oc get identity
    

  2. Review role bindings:

    Bash
    oc get clusterrolebinding
    oc get rolebinding --all-namespaces
    

Registry Issues

  1. Check registry status:

    Bash
    oc get pods -n openshift-image-registry
    oc get configs.imageregistry.operator.openshift.io cluster -o yaml
    

  2. Review storage configuration:

    Bash
    oc get pvc -n openshift-image-registry
    oc describe pvc -n openshift-image-registry
    

Collecting Diagnostics

  1. Gather must-gather data:

    Bash
    # General cluster data
    oc adm must-gather
    
    # Specific component data
    oc adm must-gather --image=registry.redhat.io/rhacm2/acm-must-gather-rhel8:v2.8
    

  2. Review cluster logs:

    Bash
    # Control plane logs
    oc logs -n openshift-controller-manager deployment/controller-manager
    
    # Node logs
    oc adm node-logs <node-name> -u kubelet
    
    # Specific pod logs
    oc logs -n <namespace> <pod-name> --previous
    

  3. Export cluster state:

    Bash
    # Full cluster state
    oc get all -A -o yaml > cluster-state.yaml
    
    # Specific component state
    oc get nodes -o yaml > nodes-state.yaml
    oc get co -o yaml > operators-state.yaml
    

  4. Check etcd health:

    Bash
    oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint health
    oc rsh -n openshift-etcd etcd-<control-plane-node> etcdctl endpoint status -w table
    

  5. Monitor API server metrics:

    Bash
    oc get --raw /metrics | grep apiserver_request_duration_seconds
    

DRP-Specific Troubleshooting

  1. Check DRP machine status:

    Bash
    drpcli machines show <machine-uuid>
    

  2. Examine task execution:

    Bash
    drpcli tasks status <task-uuid>
    drpcli tasks logs <task-uuid>
    

API Health Verification

Test API server health:

Bash
# Test API server health directly
curl -k https://api.demo.k8s.local:6443/healthz

# Get API server version
curl -k https://api.demo.k8s.local:6443/version

Best Practices

  1. Maintain cluster documentation including:
  2. Network configuration
  3. Storage layout
  4. Authentication setup
  5. Custom configurations

  6. Implement systematic log collection and retention

  7. Create and maintain runbooks for common issues

  8. Document configuration changes and their rationale

  9. Establish clear escalation paths for different types of issues