CKA Certified Kubernetes Administrator Storage and Troubleshooting

Managing persistent data and diagnosing cluster failures are not just exam topics—they are daily realities for Kubernetes administrators. Your ability to correctly provision storage and methodically troubleshoot issues directly impacts application availability and is rigorously tested on the CKA exam. This guide breaks down the core concepts and systematic approaches you need to master.

Understanding Persistent Storage Fundamentals

In Kubernetes, a PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using a StorageClass. Think of a PV as a physical disk resource available to the cluster. A PersistentVolumeClaim (PVC) is a request for storage by a user. It's like a pod consuming node resources; the PVC consumes PV resources. The lifecycle involves Provisioning, Binding, Using, Reclaiming, and potentially Deleting.

A critical configuration is the volume access mode, which defines how a volume can be mounted. The three modes are:

ReadWriteOnce (RWO): The volume can be mounted as read-write by a single node.
ReadOnlyMany (ROX): The volume can be mounted read-only by many nodes.
ReadWriteMany (RWX): The volume can be mounted as read-write by many nodes.

Your choice dictates which workloads can use the storage. For example, a single-pod database typically uses RWO, while a shared configuration file for a web server farm might need RWX.

A StorageClass provides a way for administrators to describe the "classes" of storage they offer. Different classes might map to quality-of-service levels, backup policies, or arbitrary policies determined by the cluster administrators. Each StorageClass contains a provisioner field, which determines what volume plugin is used for provisioning PVs (e.g., kubernetes.io/aws-ebs, kubernetes.io/gce-pd, csi.example.com). When a PVC references a StorageClass, dynamic provisioning is triggered, automating the creation of the backing PV.

Troubleshooting Pod Scheduling and Application Failures

When a pod is stuck in Pending, the issue is almost always scheduling. Your first command should be kubectl describe pod <pod-name>. Examine the Events section at the bottom for clear messages like "Insufficient cpu" or "Insufficient memory." If a node selector or affinity/anti-affinity rule is too restrictive, it will be indicated here.

For running pods with issues (e.g., CrashLoopBackOff, Error), a structured approach is key:

Inspect the Pod: Use kubectl describe pod for an overview of state, events, and configuration.
Examine Logs: Use kubectl logs <pod-name>. For multi-container pods, specify the container with -c <container-name>. Use -f to follow logs in real-time and -p to get logs from a previously crashed instance of the container.
Execute into the Pod: If logs are insufficient, gain shell access for deeper inspection with kubectl exec -it <pod-name> -- /bin/sh. This allows you to check internal files, processes, and network connectivity.

A common exam scenario involves troubleshooting a misconfigured kubelet. If a node is reporting NotReady and pods cannot be scheduled onto it, SSH into the node and check the kubelet service status: systemctl status kubelet. Examine its logs with journalctl -u kubelet. Common issues include incorrect TLS certificates or misconfigured container runtime.

Diagnosing Networking and Control Plane Issues

Networking problems often manifest as an inability of pods to communicate with each other (cluster-internal networking) or with external services. Start by verifying the CoreDNS (or kube-dns) service is running: kubectl get pods -n kube-system -l k8s-app=kube-dns. If the CoreDNS pods are failing, check their logs. Next, verify basic connectivity by executing into a pod and using commands like nslookup kubernetes.default (to test service discovery) and curl to other pod IPs or service names.

For control plane problems, your troubleshooting depends on the component. The CKA exam expects you to know where key component logs are located:

kube-apiserver: Typically runs as a static pod. Check its logs on the master node via kubectl logs -n kube-system <kube-apiserver-pod-name>.
kube-scheduler & kube-controller-manager: Also often run as static pods. Use kubectl logs similarly.
etcd: As a critical system service, its logs are usually checked on the host node with journalctl -u etcd.

If a control plane component is down, you may need to restart its service. For a static pod, you would modify its manifest in /etc/kubernetes/manifests/ on the master node, and the kubelet will restart it automatically.

Node and Resource Investigation Techniques

When a node fails, you must investigate from the node itself. The primary tool is journalctl, the systemd journal utility. To see all logs from the kubelet (the primary node agent), use journalctl -u kubelet. To filter for recent errors, you can use journalctl -u kubelet --since "5 minutes ago" | grep -i error. For issues related to container runtime (like containerd or Docker), inspect their logs: journalctl -u containerd.

Resource monitoring is crucial for diagnosing performance-related scheduling failures. On the node, use classic Linux tools:

top or htop to view overall CPU/Memory usage.
df -h to check disk space on critical mounts like /var/lib/kubelet.
ss or netstat to investigate network connections and port conflicts.

From the cluster level, kubectl describe node <node-name> provides a comprehensive summary of the node's capacity, allocatable resources, and a breakdown of resource requests and limits by all pods on the node. Look for pods nearing their limits or nodes under memory pressure.

Common Pitfalls

Confusing Volume Access Modes: Deploying a Deployment with multiple replicas using a PVC with ReadWriteOnce will cause pod scheduling failures. Only one pod can be scheduled on the node that has mounted the RWO volume. For a multi-pod deployment, you typically need a storage backend that supports ReadWriteMany or a separate volume for each pod (using a StatefulSet).
Ignoring StorageClass Annotation: For dynamic provisioning, the PVC must either reference a specific storageClassName or be annotated to use the default StorageClass (.spec.storageClassName: ""). A PVC without a StorageClass and no default provisioner configured will remain pending indefinitely.
Incomplete Troubleshooting: Jumping to conclusions before gathering all evidence. Always follow the sequence: describe -> logs -> exec. Check the output of kubectl get events --all-namespaces for cluster-wide warnings. Overlooking the kubectl describe output is a frequent exam mistake.
Misdiagnosing Networking Issues: Assuming a connectivity problem is a network policy issue before checking CoreDNS and basic pod-to-pod IP connectivity. Always verify DNS resolution and service endpoints (kubectl get endpoints <service-name>) before diving into NetworkPolicy rules.

Summary

PersistentVolumes (PVs) are cluster storage resources, while PersistentVolumeClaims (PVCs) are user requests for that storage, bound together via access modes and StorageClasses.
Dynamic provisioning with StorageClasses automates PV creation and is a core administrative pattern.
A systematic troubleshooting methodology is non-negotiable: start with kubectl describe for state and events, then use kubectl logs for application output, and finally kubectl exec for deeper inspection.
Control plane components (kube-apiserver, etcd, scheduler) often run as static pods; their logs are accessible via kubectl logs on the master node or journalctl on the host.
Node-level troubleshooting requires SSH access and the use of journalctl -u kubelet and OS tools (top, df, ss) to diagnose resource and service failures.
Always verify the fundamental requirements first during exam troubleshooting: pod resource requests, node status, volume access modes, and service DNS resolution.

CKA Certified Kubernetes Administrator Storage and Troubleshooting

CKA Certified Kubernetes Administrator Storage and Troubleshooting

Understanding Persistent Storage Fundamentals

Troubleshooting Pod Scheduling and Application Failures

Diagnosing Networking and Control Plane Issues

Node and Resource Investigation Techniques

Common Pitfalls

Summary

Write better notes with AI