본문 바로가기
IT 기술/k8s

[cka] TroubleShooting - Worker Node Failure

by Geunny 2024. 8. 25.
반응형

1. Fix the broken cluster

 

controlplane ~ ➜  k get nodes
NAME           STATUS     ROLES           AGE   VERSION
controlplane   Ready      control-plane   11m   v1.30.0
node01         NotReady   <none>          11m   v1.30.0

controlplane ~ ➜  k get pods -A
NAMESPACE      NAME                                   READY   STATUS    RESTARTS   AGE
kube-flannel   kube-flannel-ds-c8pzs                  1/1     Running   0          11m
kube-flannel   kube-flannel-ds-g7bmr                  1/1     Running   0          11m
kube-system    coredns-768b85b76f-b8z7m               1/1     Running   0          11m
kube-system    coredns-768b85b76f-pm728               1/1     Running   0          11m
kube-system    etcd-controlplane                      1/1     Running   0          11m
kube-system    kube-apiserver-controlplane            1/1     Running   0          11m
kube-system    kube-controller-manager-controlplane   1/1     Running   0          11m
kube-system    kube-proxy-wmrb5                       1/1     Running   0          11m
kube-system    kube-proxy-x76qh                       1/1     Running   0          11m
kube-system    kube-scheduler-controlplane            1/1     Running   0          11m

controlplane ~ ➜  k describe nodes node01
Name:               node01
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=node01
                    kubernetes.io/os=linux
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"a2:4c:87:6a:82:c9"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.22.85.9
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 25 Aug 2024 13:12:29 +0000
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  node01
  AcquireTime:     <unset>
  RenewTime:       Sun, 25 Aug 2024 13:22:21 +0000
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Sun, 25 Aug 2024 13:12:35 +0000   Sun, 25 Aug 2024 13:12:35 +0000   FlannelIsUp         Flannel is running on this node
  MemoryPressure       Unknown   Sun, 25 Aug 2024 13:18:05 +0000   Sun, 25 Aug 2024 13:23:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Sun, 25 Aug 2024 13:18:05 +0000   Sun, 25 Aug 2024 13:23:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure          Unknown   Sun, 25 Aug 2024 13:18:05 +0000   Sun, 25 Aug 2024 13:23:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Sun, 25 Aug 2024 13:18:05 +0000   Sun, 25 Aug 2024 13:23:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:  192.22.85.9
  Hostname:    node01
Capacity:
  cpu:                36
  ephemeral-storage:  1016057248Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             214587052Ki
  pods:               110
Allocatable:
  cpu:                36
  ephemeral-storage:  936398358207
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             214484652Ki
  pods:               110
System Info:
  Machine ID:                 69ee5c89434f4d5baea262a6ecc698fe
  System UUID:                ccf22a91-925e-0514-bae9-ba19f8cc85c8
  Boot ID:                    8a9382a1-cb7b-462d-a4a2-a8e3e1d79f13
  Kernel Version:             5.4.0-1106-gcp
  OS Image:                   Ubuntu 22.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.26
  Kubelet Version:            v1.30.0
  Kube-Proxy Version:         v1.30.0
PodCIDR:                      10.244.1.0/24
PodCIDRs:                     10.244.1.0/24
Non-terminated Pods:          (2 in total)
  Namespace                   Name                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                     ------------  ----------  ---------------  -------------  ---
  kube-flannel                kube-flannel-ds-c8pzs    100m (0%)     0 (0%)      50Mi (0%)        0 (0%)         11m
  kube-system                 kube-proxy-x76qh         0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                100m (0%)  0 (0%)
  memory             50Mi (0%)  0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:
  Type     Reason                   Age                From             Message
  ----     ------                   ----               ----             -------
  Normal   Starting                 11m                kube-proxy       
  Normal   Starting                 11m                kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      11m                kubelet          invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  11m (x2 over 11m)  kubelet          Node node01 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    11m (x2 over 11m)  kubelet          Node node01 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     11m (x2 over 11m)  kubelet          Node node01 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  11m                kubelet          Updated Node Allocatable limit across pods
  Normal   NodeReady                11m                kubelet          Node node01 status is now: NodeReady
  Normal   RegisteredNode           11m                node-controller  Node node01 event: Registered Node node01 in Controller
  Normal   NodeNotReady             45s                node-controller  Node node01 status is now: NodeNotReady

 

kube-system pod 확인시 문제는 없어보이나 node01 kubelet 이 문제가 있어 보인다. ssh 통해 node01 로 접속

 

ssh node01

node01 ~ ✖ ps -ef | grep kubelet


node01 ~ ✖ systemctl start kubelet

controlplane ~ ➜  k get nodes
NAME           STATUS   ROLES           AGE   VERSION
controlplane   Ready    control-plane   16m   v1.30.0
node01         Ready    <none>          16m   v1.30.0

ps -ef 를 통해 kubelet 을 확인해보니 서비스가 실행중이지 않다. 서비스를 실행시켜준다.

 

2. The cluster is broken again. Investigate and fix the issue.

 

ssh 통해 node01 접속해서 동일하게 kubelet 을 실행해 보았으나 실행이 안됨.

journalctl -u kubelet 을 통해 kubelet 의 상태를 확인해 보자.

Aug 25 13:37:29 node01 kubelet[13319]: E0825 13:37:29.883623   13319 run.go:74] "command failed" err="failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/pki/WRONG-CA-FILE.crt: open /etc/kubernetes/pki/WRONG-CA-FILE.crt: no such file or directory"
Aug 25 13:37:40 node01 kubelet[13385]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Aug 25 13:37:40 node01 kubelet[13385]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Aug 25 13:37:40 node01 kubelet[13385]: I0825 13:37:40.129231   13385 server.go:205] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
Aug 25 13:37:40 node01 kubelet[13385]: E0825 13:37:40.130733   13385 run.go:74] "command failed" err="failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/pki/WRONG-CA-FILE.crt: open /etc/kubernetes/pki/WRONG-CA-FILE.crt: no such file or directory"

 

위 로그로 보았을때 kube-api-server 에서 .crt 파일이 잘못되어 있는것 같다. node config 설정을 수정한다.

node01 /var/lib/kubelet ➜  ls
checkpoints        kubeadm-flags.env     plugins_registry
config.yaml        memory_manager_state  pod-resources
cpu_manager_state  pki                   pods
device-plugins     plugins

node01 /var/lib/kubelet ➜  vi config.yaml 

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/WRONG-CA-FILE.crt # 수정

 

 

 

3.  The cluster is broken again. Investigate and fix the issue.

 

controlplane ~ ➜  k get nodes
NAME           STATUS     ROLES           AGE   VERSION
controlplane   Ready      control-plane   32m   v1.30.0
node01         NotReady   <none>          31m   v1.30.0

controlplane ~ ➜ ssh node01
Last login: Sun Aug 25 13:30:44 2024 from 192.22.85.6

node01 ~ ➜  journalctl -u kubelet


Aug 25 13:11:50 node01 kubelet[1915]: E0825 13:11:50.215820    1915 run.go:74] "command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"

 

 

위 로그로 보아 kube-api-server 의 포트 설정이 잘못되어 있어보인다. 6443 포트가 아닌 6553 포트로 요청하고 있다. 이를 수정하자.

 

node01 ~ ➜  vi /etc/kubernetes/kubelet.conf

---

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ...
    server: https://controlplane:6553 ## -> 6443
    
---

# 수정후 kubelet 재실행

node01 ~ ➜  systemctl restart kubelet

 

 

댓글