티스토리 뷰

728x90
반응형

본 포스팅에서는 Kubernetes를 기동하거나 운영시 장애가 발생했을 경우 어떻게 트러블슈팅을 진행해야 하는지에 대해 알아보도록 하겠습니다.

이제 Kubernetes는 Cloud를 지향하는 모든 고객사에서 필수적인 요소가 되었습니다. CSP, On-prem 가릴것 없이 Kubernetes를 설계할 수 있어야 합니다.

다만 현 대부분의 고객사들이나 운영자들은 Kubernetes의 완전한 이해 이전에 먼저 설치하고 부딪혀 보기를 원하지만, 생각보다 이는 쉽지 않은 일입니다.

이에 이번 포스팅에서는 최소한 Kubernetes에 문제가 발생했을 경우 어떻게 문제를 찾아 내고 진단해야 하는지 몇가지 스텝을 통해 살펴보고자 합니다.

# 본 예시는 하나의 장애 상황을 가정하고 이를 해결해 나가는 방법에 대해 설명하고자합니다.

장애 발생

문제. Kubernetes 기동 시 calico-kube-controller & coredns pending 상태로 유지되는 현상

1. 최초 kubectl get pods -o wide --all-namespaces 명령어로 현재 Kubernetes에 의해 배포된 pods들의 상태를 확인해야 한다.

[kubectl get pods -o wide --all-namespaces]


[root@kubemaster success_log]# kubectl get pods -o wide --all-namespaces 
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES 
kube-system   calico-kube-controllers-6c954486fd-tmlnl   0/1     Pending   0          25m                               
kube-system   calico-node-ksmgr                          1/1     Running   1          26m   172.21.20.51   kubeworker               
kube-system   calico-node-rr5gm                          1/1     Running   1          26m   172.21.20.50   kubemaster               
kube-system   coredns-7ff7988d-rd2l5                     0/1     Pending   0          25m                               
kube-system   dns-autoscaler-748467c887-bhqxv            0/1     Pending   0          25m                               
kube-system   kube-apiserver-kubemaster                  1/1     Running   0          27m   172.21.20.50   kubemaster               
kube-system   kube-controller-manager-kubemaster         1/1     Running   0          27m   172.21.20.50   kubemaster               
kube-system   kube-proxy-85csh                           1/1     Running   0          26m   172.21.20.51   kubeworker               
kube-system   kube-proxy-9p9z7                           1/1     Running   0          27m   172.21.20.50   kubemaster               
kube-system   kube-scheduler-kubemaster                  1/1     Running   0          27m   172.21.20.50   kubemaster               
kube-system   kubernetes-dashboard-76b866f6cf-xfnm7      0/1     Pending   0          25m                               
kube-system   nginx-proxy-kubeworker                     1/1     Running   0          26m   172.21.20.51   kubeworker               
kube-system   nodelocaldns-kr7q5                         1/1     Running   0          25m   172.21.20.50   kubemaster               
kube-system   nodelocaldns-plh7m                         1/1     Running   0          25m   172.21.20.51  kubeworker

[root@kubemaster success_log]# 


TroubleShooting

Kubernetes가 일정 시간동안 기동이 완료되지 않고 Pending 상태로 유지되는 현상 발생

2. Pending 상태의 pod에 대한 상세정보를 확인하기 위해 kubectl describe pod calico-kube-controllers-6c954486fd-tmlnl -n kube-system 명령어로 상세 정보를 확인한다.

[kubectl descrive pod [pod_name]]


[root@kubemaster success_log]# kubectl describe pod calico-kube-controllers-6c954486fd-tmlnl -n kube-system 
Name:                 calico-kube-controllers-6c954486fd-tmlnl 
Namespace:            kube-system 
Priority:             2000000000 
Priority Class Name:  system-cluster-critical 
Node:                  
Labels:               k8s-app=calico-kube-controllers 
                      kubernetes.io/cluster-service=true 
                      pod-template-hash=6c954486fd 
Annotations:           
Status:               Pending 
IP: 
IPs:                   
Controlled By:        ReplicaSet/calico-kube-controllers-6c954486fd 
Containers: 
  calico-kube-controllers: 
    Image:      172.31.85.33:13000/calico/kube-controllers:v3.7.3 
    Port:        
    Host Port:   
    Limits: 
      cpu:     100m 
      memory:  256M 
    Requests: 
      cpu:      30m 
      memory:   64M 
    Readiness:  exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3 
    Environment: 
      ETCD_ENDPOINTS:     https://172.21.20.50:2379 
      ETCD_CA_CERT_FILE:  /etc/calico/certs/ca_cert.crt 
      ETCD_CERT_FILE:     /etc/calico/certs/cert.crt 
      ETCD_KEY_FILE:      /etc/calico/certs/key.pem 
    Mounts: 
      /etc/calico/certs from etcd-certs (ro) 
      /var/run/secrets/kubernetes.io/serviceaccount from calico-kube-controllers-token-n4wwv (ro) 
Conditions: 
  Type           Status 
  PodScheduled   False 
Volumes: 
  etcd-certs: 
    Type:          HostPath (bare host directory volume) 
    Path:          /etc/calico/certs 
    HostPathType: 
  calico-kube-controllers-token-n4wwv: 
    Type:        Secret (a volume populated by a Secret) 
    SecretName:  calico-kube-controllers-token-n4wwv 
    Optional:    false 
QoS Class:       Burstable 
Node-Selectors:  beta.kubernetes.io/os=linux 
Tolerations:     CriticalAddonsOnly 
                 node-role.kubernetes.io/master:NoSchedule 
                 node.kubernetes.io/not-ready:NoExecute for 300s 
                 node.kubernetes.io/unreachable:NoExecute for 300s 
Events: 
  Type     Reason            Age        From               Message 
  ----     ------            ----       ----               ------- 
  Warning  FailedScheduling    default-scheduler  0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
  Warning  FailedScheduling    default-scheduler  0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.

[root@kubemaster success_log]# kubectl get replicaset --all-namespaces 
NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE 
kube-system   calico-kube-controllers-6c954486fd   1         1         0       27m 
kube-system   coredns-7ff7988d                     1         1         0       27m 
kube-system   dns-autoscaler-748467c887            1         1         0       27m 
kube-system   kubernetes-dashboard-76b866f6cf      1         1         0       27m 
[root@kubemaster success_log]# 


현재 describe 상태로 확인한 결과 위 빨간색 내용처럼 n node(s) had taints that the pod didn`t tolerate.라는 메시지가 pending 상태의 원인임을 알수 있습니다.

로그 메시지와 같이 Kubernetes는 POD를 배치하기 위한 taint와 tolerate를 지정할 수 있습니다.

이에 대한 자세한 부분은 이후에 다루기로 하며, taint로 지정된 role을 허용하기 위한 tolerate가 지정되어야 하지만, 현재 이를 만족하지 못해 해당 노드에서 pod를 기동하지 못하는 상황임을 추정할 수 있습니다.

3. 일반적으로 해당 정보가 확인되면 보다 상세한 정보를 확인하기 위해 kubectl logs calico-kube-controllers-6c954486fd-tmlnl -n kube-system pod의 logs를 직접 확인해야 합니다.

[kubectl logs [pod_name] -n [namespace_name]]


[root@kubemaster success_log]# kubectl logs calico-kube-controllers-6c954486fd-tmlnl -n kube-system 

[root@kubemaster success_log]# 


다만 kubectl logs의 경우 pods가 배치된 이후 기동 정보를 기록하기 때문에 pod배치 자체가 불가능한 현 상태의 로그를 전혀 기록되어 있지 않은 것을 알 수 있습니다.

이를 기반으로 다음 TroubleShooting을 진행해 보겠습니다.

4. 로그가 발생되지 않았기에 Kubernetes에 의해 발생한 모든 event 정보를 확인하기 위해 kubectl get events -n kube-system를 확인합니다.

[kubectl get events -n kube-system]


[root@kubemaster success_log]# kubectl get events -n kube-system 
LAST SEEN   TYPE      REASON              OBJECT                                          MESSAGE 
   Warning   FailedScheduling    pod/calico-kube-controllers-6c954486fd-tmlnl    0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
   Warning   FailedScheduling    pod/calico-kube-controllers-6c954486fd-tmlnl    0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
31m         Warning   FailedCreate        replicaset/calico-kube-controllers-6c954486fd   Error creating: pods "calico-kube-controllers-6c954486fd-" is forbidden: error looking up service account kube-system/calico-kube-controllers: serviceaccount "calico-kube-controllers" not found 
31m         Normal    SuccessfulCreate    replicaset/calico-kube-controllers-6c954486fd   Created pod: calico-kube-controllers-6c954486fd-tmlnl 
31m         Normal    ScalingReplicaSet   deployment/calico-kube-controllers              Scaled up replica set calico-kube-controllers-6c954486fd to 1 
   Normal    Scheduled           pod/calico-node-ksmgr                           Successfully assigned kube-system/calico-node-ksmgr to kubeworker 
31m         Normal    Pulled              pod/calico-node-ksmgr                           Container image "172.31.85.33:13000/calico/cni:v3.7.3" already present on machine 
31m         Normal    Created             pod/calico-node-ksmgr                           Created container install-cni 
31m         Normal    Started             pod/calico-node-ksmgr                           Started container install-cni 
31m         Normal    Pulled              pod/calico-node-ksmgr                           Container image "172.31.85.33:13000/calico/node:v3.7.3" already present on machine 
31m         Normal    Created             pod/calico-node-ksmgr                           Created container calico-node 
31m         Normal    Started             pod/calico-node-ksmgr                           Started container calico-node 
31m         Warning   Unhealthy           pod/calico-node-ksmgr                           Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 172.21.20.50

2019-10-25 05:22:06.381 [INFO][179] readiness.go 88: Number of node(s) with BGP peering established = 0 
31m         Normal    Killing             pod/calico-node-ksmgr                           Stopping container calico-node 
   Normal    Scheduled           pod/calico-node-rr5gm                           Successfully assigned kube-system/calico-node-rr5gm to kubemaster 
31m         Normal    Pulled              pod/calico-node-rr5gm                           Container image "172.31.85.33:13000/calico/cni:v3.7.3" already present on machine 
31m         Normal    Created             pod/calico-node-rr5gm                           Created container install-cni 
31m         Normal    Started             pod/calico-node-rr5gm                           Started container install-cni 
31m         Normal    Pulled              pod/calico-node-rr5gm                           Container image "172.31.85.33:13000/calico/node:v3.7.3" already present on machine 
31m         Normal    Created             pod/calico-node-rr5gm                           Created container calico-node 
31m         Normal    Started             pod/calico-node-rr5gm                           Started container calico-node 
31m         Normal    Killing             pod/calico-node-rr5gm                           Stopping container calico-node 
31m         Warning   Unhealthy           pod/calico-node-rr5gm                           Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503 
31m         Warning   FailedCreate        daemonset/calico-node                           Error creating: pods "calico-node-" is forbidden: error looking up service account kube-system/calico-node: serviceaccount "calico-node" not found 
31m         Normal    SuccessfulCreate    daemonset/calico-node                           Created pod: calico-node-ksmgr 
31m         Normal    SuccessfulCreate    daemonset/calico-node                           Created pod: calico-node-rr5gm 
   Warning   FailedScheduling    pod/coredns-7ff7988d-rd2l5                      0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
   Warning   FailedScheduling    pod/coredns-7ff7988d-rd2l5                      0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
30m         Normal    SuccessfulCreate    replicaset/coredns-7ff7988d                     Created pod: coredns-7ff7988d-rd2l5 
30m         Normal    ScalingReplicaSet   deployment/coredns                              Scaled up replica set coredns-7ff7988d to 1 
   Warning   FailedScheduling    pod/dns-autoscaler-748467c887-bhqxv             0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
   Warning   FailedScheduling    pod/dns-autoscaler-748467c887-bhqxv             0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
30m         Normal    SuccessfulCreate    replicaset/dns-autoscaler-748467c887            Created pod: dns-autoscaler-748467c887-bhqxv 
30m         Normal    ScalingReplicaSet   deployment/dns-autoscaler                       Scaled up replica set dns-autoscaler-748467c887 to 1 
33m         Normal    Pulled              pod/kube-apiserver-kubemaster                   Container image "172.31.85.33:13000/google-containers/kube-apiserver:v1.16.0" already present on machine 
33m         Normal    Created             pod/kube-apiserver-kubemaster                   Created container kube-apiserver 
33m         Normal    Started             pod/kube-apiserver-kubemaster                   Started container kube-apiserver 
33m         Normal    Pulled              pod/kube-controller-manager-kubemaster          Container image "172.31.85.33:13000/google-containers/kube-controller-manager:v1.16.0" already present on machine 
33m         Normal    Created             pod/kube-controller-manager-kubemaster          Created container kube-controller-manager 
33m         Normal    Started             pod/kube-controller-manager-kubemaster          Started container kube-controller-manager 
32m         Normal    Pulled              pod/kube-controller-manager-kubemaster          Container image "172.31.85.33:13000/google-containers/kube-controller-manager:v1.16.0" already present on machine 
32m         Normal    Created             pod/kube-controller-manager-kubemaster          Created container kube-controller-manager 
32m         Normal    Started             pod/kube-controller-manager-kubemaster          Started container kube-controller-manager 
33m         Normal    LeaderElection      endpoints/kube-controller-manager               kubemaster_cd17761a-5ec7-419b-ba0b-c15877040954 became leader 
32m         Normal    LeaderElection      endpoints/kube-controller-manager               kubemaster_860a0eeb-16ba-4380-8011-8412d3b5e241 became leader 
   Normal    Scheduled           pod/kube-proxy-85csh                            Successfully assigned kube-system/kube-proxy-85csh to kubeworker 
31m         Normal    Pulled              pod/kube-proxy-85csh                            Container image "172.31.85.33:13000/google-containers/kube-proxy:v1.16.0" already present on machine 
31m         Normal    Created             pod/kube-proxy-85csh                            Created container kube-proxy 
31m         Normal    Started             pod/kube-proxy-85csh                            Started container kube-proxy 
   Normal    Scheduled           pod/kube-proxy-9p9z7                            Successfully assigned kube-system/kube-proxy-9p9z7 to kubemaster 
32m         Normal    Pulled              pod/kube-proxy-9p9z7                            Container image "172.31.85.33:13000/google-containers/kube-proxy:v1.16.0" already present on machine 
32m         Normal    Created             pod/kube-proxy-9p9z7                            Created container kube-proxy 
32m         Normal    Started             pod/kube-proxy-9p9z7                            Started container kube-proxy 
32m         Normal    SuccessfulCreate    daemonset/kube-proxy                            Created pod: kube-proxy-9p9z7 
31m         Normal    SuccessfulCreate    daemonset/kube-proxy                            Created pod: kube-proxy-85csh 
33m         Normal    Pulled              pod/kube-scheduler-kubemaster                   Container image "172.31.85.33:13000/google-containers/kube-scheduler:v1.16.0" already present on machine 
33m         Normal    Created             pod/kube-scheduler-kubemaster                   Created container kube-scheduler 
33m         Normal    Started             pod/kube-scheduler-kubemaster                   Started container kube-scheduler 
32m         Normal    Pulled              pod/kube-scheduler-kubemaster                   Container image "172.31.85.33:13000/google-containers/kube-scheduler:v1.16.0" already present on machine 
32m         Normal    Created             pod/kube-scheduler-kubemaster                   Created container kube-scheduler 
32m         Normal    Started             pod/kube-scheduler-kubemaster                   Started container kube-scheduler 
33m         Normal    LeaderElection      endpoints/kube-scheduler                        kubemaster_6aca6003-392a-4b81-827c-34b05c8f7fb6 became leader 
32m         Normal    LeaderElection      endpoints/kube-scheduler                        kubemaster_57750948-7d68-44c3-90ed-7d89b7f8fab5 became leader 
   Warning   FailedScheduling    pod/kubernetes-dashboard-76b866f6cf-xfnm7       0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
   Warning   FailedScheduling    pod/kubernetes-dashboard-76b866f6cf-xfnm7       0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. 
30m         Normal    SuccessfulCreate    replicaset/kubernetes-dashboard-76b866f6cf      Created pod: kubernetes-dashboard-76b866f6cf-xfnm7 
30m         Normal    ScalingReplicaSet   deployment/kubernetes-dashboard                 Scaled up replica set kubernetes-dashboard-76b866f6cf to 1 
31m         Normal    Pulled              pod/nginx-proxy-kubeworker                      Container image "172.31.85.33:13000/library/nginx:1.15" already present on machine 
31m         Normal    Created             pod/nginx-proxy-kubeworker                      Created container nginx-proxy 
31m         Normal    Started             pod/nginx-proxy-kubeworker                      Started container nginx-proxy 
   Normal    Scheduled           pod/nodelocaldns-kr7q5                          Successfully assigned kube-system/nodelocaldns-kr7q5 to kubemaster 
30m         Normal    Pulled              pod/nodelocaldns-kr7q5                          Container image "172.31.85.33:13000/google-containers/k8s-dns-node-cache:1.15.5" already present on machine 
30m         Normal    Created             pod/nodelocaldns-kr7q5                          Created container node-cache 
30m         Normal    Started             pod/nodelocaldns-kr7q5                          Started container node-cache 
   Normal    Scheduled           pod/nodelocaldns-plh7m                          Successfully assigned kube-system/nodelocaldns-plh7m to kubeworker 
30m         Normal    Pulled              pod/nodelocaldns-plh7m                          Container image "172.31.85.33:13000/google-containers/k8s-dns-node-cache:1.15.5" already present on machine 
30m         Normal    Created             pod/nodelocaldns-plh7m                          Created container node-cache 
30m         Normal    Started             pod/nodelocaldns-plh7m                          Started container node-cache 
30m         Normal    SuccessfulCreate    daemonset/nodelocaldns                          Created pod: nodelocaldns-kr7q5 
30m         Normal    SuccessfulCreate    daemonset/nodelocaldns                          Created pod: nodelocaldns-plh7m 
[root@kubemaster success_log]# 


5. 다음으로 kubernetes에 join되어 있는 node들의 상태 정보를 확인하기 위해 kubectl describe node master를 실행합니다.

taint와 toleration는 node 기반으로 부여되기 때문에 현재 노드 상태를 확인하고 pods가 배치 될 수 있는 상태인지 확인합니다.

[kubectl describe node [node name]]


[root@kubemaster kubespray]# kubectl describe node kubemaster 
Name:               kubemaster 
Roles:              master 
Labels:             beta.kubernetes.io/arch=amd64 
                    beta.kubernetes.io/os=linux 
                    kubernetes.io/arch=amd64 
                    kubernetes.io/hostname=kubemaster 
                    kubernetes.io/os=linux 
                    node-role.kubernetes.io/master= 
Annotations:        alpha.kubernetes.io/provided-node-ip: 172.21.20.50 
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock 
                    node.alpha.kubernetes.io/ttl: 0 
                    volumes.kubernetes.io/controller-managed-attach-detach: true 
CreationTimestamp:  Fri, 25 Oct 2019 03:57:24 -0400 
Taints:             node-role.kubernetes.io/master:NoSchedule 
                    node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule 
Unschedulable:      false 
Conditions: 
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message 
  ----                 ------  -----------------                 ------------------                ------                       ------- 
  NetworkUnavailable   False   Fri, 25 Oct 2019 03:59:17 -0400   Fri, 25 Oct 2019 03:59:17 -0400   CalicoIsUp                   Calico is running on this node 
  MemoryPressure       False   Fri, 25 Oct 2019 04:00:54 -0400   Fri, 25 Oct 2019 03:57:21 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available 
  DiskPressure         False   Fri, 25 Oct 2019 04:00:54 -0400   Fri, 25 Oct 2019 03:57:21 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure 
  PIDPressure          False   Fri, 25 Oct 2019 04:00:54 -0400   Fri, 25 Oct 2019 03:57:21 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available 
  Ready                True    Fri, 25 Oct 2019 04:00:54 -0400   Fri, 25 Oct 2019 03:59:14 -0400   KubeletReady                 kubelet is posting ready status 
Addresses: 
  InternalIP:  172.21.20.50 
  Hostname:    kubemaster 
Capacity: 
 cpu:                16 
 ephemeral-storage:  104846316Ki 
 hugepages-1Gi:      0 
 hugepages-2Mi:      0 
 memory:             16264568Ki 
 pods:               110 
Allocatable: 
 cpu:                15800m 
 ephemeral-storage:  96626364666 
 hugepages-1Gi:      0 
 hugepages-2Mi:      0 
 memory:             15662168Ki 
 pods:               110 
System Info: 
 Machine ID:                 0fe9ccd97b313b47afe0d301b5145d20 
 System UUID:                7880C4F4-A248-47B6-AD36-FC3C6C4A5E52 
 Boot ID:                    58af9a9a-1fe2-4dff-8e08-ce213ee25951 
 Kernel Version:             3.10.0-957.21.3.el7.x86_64 
 OS Image:                   Red Hat Enterprise Linux Server 7.6 (Maipo) 
 Operating System:           linux 
 Architecture:               amd64 
 Container Runtime Version:  docker://19.3.4 
 Kubelet Version:            v1.16.0 
 Kube-Proxy Version:         v1.16.0 
PodCIDR:                     10.233.64.0/24 
PodCIDRs:                    10.233.64.0/24 
Non-terminated Pods:         (6 in total) 
  Namespace                  Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE 
  ---------                  ----                                  ------------  ----------  ---------------  -------------  --- 
  kube-system                calico-node-mvbkm                     150m (0%)     300m (1%)   64M (0%)         500M (3%)      111s 
  kube-system                kube-apiserver-kubemaster             250m (1%)     0 (0%)      0 (0%)           0 (0%)         2m55s 
  kube-system                kube-controller-manager-kubemaster    200m (1%)     0 (0%)      0 (0%)           0 (0%)         2m55s 
  kube-system                kube-proxy-cwtdz                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m15s 
  kube-system                kube-scheduler-kubemaster             100m (0%)     0 (0%)      0 (0%)           0 (0%)         2m55s 
  kube-system                nodelocaldns-lwqqk                    100m (0%)     0 (0%)      70Mi (0%)        170Mi (1%)     74s 
Allocated resources: 
  (Total limits may be over 100 percent, i.e., overcommitted.) 
  Resource           Requests        Limits 
  --------           --------        ------ 
  cpu                800m (5%)       300m (1%) 
  memory             137400320 (0%)  678257920 (4%) 
  ephemeral-storage  0 (0%)          0 (0%) 
Events: 
  Type    Reason                   Age                    From                    Message 
  ----    ------                   ----                   ----                    ------- 
  Normal  Starting                 3m40s                  kubelet, kubemaster     Starting kubelet. 
  Normal  NodeHasSufficientMemory  3m39s (x7 over 3m40s)  kubelet, kubemaster     Node kubemaster status is now: NodeHasSufficientMemory 
  Normal  NodeHasNoDiskPressure    3m39s (x8 over 3m40s)  kubelet, kubemaster     Node kubemaster status is now: NodeHasNoDiskPressure 
  Normal  NodeHasSufficientPID     3m39s (x8 over 3m40s)  kubelet, kubemaster     Node kubemaster status is now: NodeHasSufficientPID 
  Normal  NodeAllocatableEnforced  3m39s                  kubelet, kubemaster     Updated Node Allocatable limit across pods 
  Normal  Starting                 2m55s                  kubelet, kubemaster     Starting kubelet. 
  Normal  NodeAllocatableEnforced  2m55s                  kubelet, kubemaster     Updated Node Allocatable limit across pods 
  Normal  NodeHasSufficientMemory  2m55s                  kubelet, kubemaster     Node kubemaster status is now: NodeHasSufficientMemory 
  Normal  NodeHasNoDiskPressure    2m55s                  kubelet, kubemaster     Node kubemaster status is now: NodeHasNoDiskPressure 
  Normal  NodeHasSufficientPID     2m55s                  kubelet, kubemaster     Node kubemaster status is now: NodeHasSufficientPID 
  Normal  Starting                 2m35s                  kube-proxy, kubemaster  Starting kube-proxy. 
  Normal  NodeReady                105s                   kubelet, kubemaster     Node kubemaster status is now: NodeReady 
[root@kubemaster kubespray]# kubectl describe node kubeworker 
Name:               kubeworker 
Roles:               
Labels:             beta.kubernetes.io/arch=amd64 
                    beta.kubernetes.io/os=linux 
                    kubernetes.io/arch=amd64 
                    kubernetes.io/hostname=kubeworker 
                    kubernetes.io/os=linux 
Annotations:        alpha.kubernetes.io/provided-node-ip: 172.21.20.50 
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock 
                    node.alpha.kubernetes.io/ttl: 0 
                    volumes.kubernetes.io/controller-managed-attach-detach: true 
CreationTimestamp:  Fri, 25 Oct 2019 03:58:57 -0400 
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule 
Unschedulable:      false 
Conditions: 
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message 
  ----                 ------  -----------------                 ------------------                ------                       ------- 
  NetworkUnavailable   False   Fri, 25 Oct 2019 03:59:18 -0400   Fri, 25 Oct 2019 03:59:18 -0400   CalicoIsUp                   Calico is running on this node 
  MemoryPressure       False   Fri, 25 Oct 2019 04:01:29 -0400   Fri, 25 Oct 2019 03:58:58 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available 
  DiskPressure         False   Fri, 25 Oct 2019 04:01:29 -0400   Fri, 25 Oct 2019 03:58:58 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure 
  PIDPressure          False   Fri, 25 Oct 2019 04:01:29 -0400   Fri, 25 Oct 2019 03:58:58 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available 
  Ready                True    Fri, 25 Oct 2019 04:01:29 -0400   Fri, 25 Oct 2019 03:59:18 -0400   KubeletReady                 kubelet is posting ready status 
Addresses: 
  InternalIP:  172.21.20.50 
  Hostname:    kubeworker 
Capacity: 
 cpu:                16 
 ephemeral-storage:  104846316Ki 
 hugepages-1Gi:      0 
 hugepages-2Mi:      0 
 memory:             16264568Ki 
 pods:               110 
Allocatable: 
 cpu:                15900m 
 ephemeral-storage:  96626364666 
 hugepages-1Gi:      0 
 hugepages-2Mi:      0 
 memory:             15912168Ki 
 pods:               110 
System Info: 
 Machine ID:                 0fe9ccd97b313b47afe0d301b5145d20 
 System UUID:                A06FB514-89B8-46E5-8A39-3BB3606B6158 
 Boot ID:                    47284b35-dd79-48f7-9511-89a7bdfc01a5 
 Kernel Version:             3.10.0-957.21.3.el7.x86_64 
 OS Image:                   Red Hat Enterprise Linux Server 7.6 (Maipo) 
 Operating System:           linux 
 Architecture:               amd64 
 Container Runtime Version:  docker://19.3.4 
 Kubelet Version:            v1.16.0 
 Kube-Proxy Version:         v1.16.0 
PodCIDR:                     10.233.65.0/24 
PodCIDRs:                    10.233.65.0/24 
Non-terminated Pods:         (4 in total) 
  Namespace                  Name                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE 
  ---------                  ----                      ------------  ----------  ---------------  -------------  --- 
  kube-system                calico-node-dpg8x         150m (0%)     300m (1%)   64M (0%)         500M (3%)      2m26s 
  kube-system                kube-proxy-mrrvb          0 (0%)        0 (0%)      0 (0%)           0 (0%)         2m37s 
  kube-system                nginx-proxy-kubeworker    25m (0%)      0 (0%)      32M (0%)         0 (0%)         2m36s 
  kube-system                nodelocaldns-kq8kh        100m (0%)     0 (0%)      70Mi (0%)        170Mi (1%)     109s 
Allocated resources: 
  (Total limits may be over 100 percent, i.e., overcommitted.) 
  Resource           Requests        Limits 
  --------           --------        ------ 
  cpu                275m (1%)       300m (1%) 
  memory             169400320 (1%)  678257920 (4%) 
  ephemeral-storage  0 (0%)          0 (0%) 
Events: 
  Type    Reason                   Age                    From                    Message 
  ----    ------                   ----                   ----                    ------- 
  Normal  Starting                 2m36s                  kubelet, kubeworker     Starting kubelet. 
  Normal  NodeHasSufficientMemory  2m36s (x2 over 2m36s)  kubelet, kubeworker     Node kubeworker status is now: NodeHasSufficientMemory 
  Normal  NodeHasNoDiskPressure    2m36s (x2 over 2m36s)  kubelet, kubeworker     Node kubeworker status is now: NodeHasNoDiskPressure 
  Normal  NodeHasSufficientPID     2m36s (x2 over 2m36s)  kubelet, kubeworker     Node kubeworker status is now: NodeHasSufficientPID 
  Normal  NodeAllocatableEnforced  2m36s                  kubelet, kubeworker     Updated Node Allocatable limit across pods 
  Normal  Starting                 2m35s                  kube-proxy, kubeworker  Starting kube-proxy. 
  Normal  NodeReady                2m16s                  kubelet, kubeworker     Node kubeworker status is now: NodeReady 
[root@kubemaster kubespray]# 


-  Kubernetes Master Node & Worker Node 모두 각각 taint로 지정되어 있는 node.cloudprovider.kubernetes.io/uninitialized와 Master Node에는 Application Pod를 배치하지 않도록 하기 위한 node-role.kubernetes.io/master:NoSchedule 가 추가로 배정되어 있습니다.

Kubernetes에서는 아래와 같은 Taint를 지정하여 노드들의 상태를 관리하고 배치할 수 있습니다.


node.kubernetes.io/not-ready: 노드가 준비되지 않았습니다. 이는 NodeCondition Ready이 " False"에 해당합니다 .
node.kubernetes.io/unreachable: 노드 컨트롤러에서 노드에 도달 할 수 없습니다. 이는 NodeCondition Ready이 " Unknown"에 해당합니다 .
node.kubernetes.io/out-of-disk: 노드가 디스크에서 벗어납니다.
node.kubernetes.io/memory-pressure: 노드에 메모리 압력이 있습니다.
node.kubernetes.io/disk-pressure: 노드에 디스크 압력이 있습니다.
node.kubernetes.io/network-unavailable: 노드의 네트워크를 사용할 수 없습니다.
node.kubernetes.io/unschedulable: 노드를 예약 할 수 없습니다.
node.cloudprovider.kubernetes.io/uninitialized: kubelet이 "외부"클라우드 공급자로 시작되면이 taint는 노드에 설정되어 사용할 수없는 것으로 표시됩니다. 클라우드 컨트롤러 관리자의 컨트롤러가 이 노드를 초기화하면 kubelet이 이 오염을 제거합니다.


현재 node.cloudprovider.kubernetes.io/uninitializedtaint에 대한 toleration이 지정되지 않아 master node에 calico를 기동할 수 없는 상황임을 알 수 있습니다.

따라서 다음과 같이 taint에 대한 조건을 해제하거나, toleration을 추가해야 합니다.

현재까지 확인 된 사항을 기반으로 원인을 진단해 보자면, taint가 추가된 kube-master node에 calico-pod를 배치하고자 하였으나, toleration이 지정되어 있지 않아 pod가 배치되지 않은 문제로 확인할 수 있습니다.

이를 해결하기 위해서는 toleration을 추가하거나, taint를 제거하는 방법이 있을 수 있습니다.

6. 간단히 toleration을 제거하기 위해 kubectl taint nodes kubemaster node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule-로 taint를 제거할 수 있습니다.


[root@kubemaster kubespray]# kubectl taint nodes kubemaster node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule- 
node/kubemaster untainted 
[root@kubemaster kubespray]# kubectl taint nodes kubemaster node-role.kubernetes.io/master:NoSchedule- 
node/kubemaster untainted 
[root@kubemaster kubespray]# kubectl taint nodes kubeworker node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule- 
node/kubeworker untainted 
[root@kubemaster kubespray]#


위와 같이 taint를 제거하면 다음과 같이 pod들이 기동되는 것을 확인할 수 있습니다.


[root@kubemaster kubespray]# kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-6c954486fd-2j5xg   1/1     Running   0          4m8s
kube-system   calico-node-dpg8x                          1/1     Running   1          4m26s
kube-system   calico-node-mvbkm                          1/1     Running   1          4m26s
kube-system   coredns-7ff7988d-c74gd                     1/1     Running   2          3m55s
kube-system   coredns-7ff7988d-fnp5x                     0/1     Error     0          43s
kube-system   dns-autoscaler-748467c887-g7flm            1/1     Running   0          3m51s
kube-system   kube-apiserver-kubemaster                  1/1     Running   0          5m30s
kube-system   kube-controller-manager-kubemaster         1/1     Running   0          5m30s
kube-system   kube-proxy-cwtdz                           1/1     Running   0          5m50s
kube-system   kube-proxy-mrrvb                           1/1     Running   0          4m37s
kube-system   kube-scheduler-kubemaster                  1/1     Running   0          5m30s
kube-system   kubernetes-dashboard-76b866f6cf-vrrqg      1/1     Running   0          3m47s
kube-system   nginx-proxy-kubeworker                     1/1     Running   0          4m36s
kube-system   nodelocaldns-kq8kh                         1/1     Running   0          3m49s
kube-system   nodelocaldns-lwqqk                         1/1     Running   0          3m49s
[root@kubemaster kubespray]#


물론 해결 방안으로는 옳바른 toleration을 추가하는 것이 바람직하며, 이에 대한 설명은 이후 과정에서 다시한번 자세히 다루도록 하겠습니다.

간단히 예를 들어 toleration을 추가해 보자면 다음과 같은 형태로 taints에 대한 toleration을 추가하여 pod 배치를 허용할 수 있습니다.


...

...

tolerations:
# This taint is set by all kubelets running `--cloud-provider=external`
# so we should tolerate it to schedule the calico pods
- key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists

...

...


본 포스팅에서 살펴본 정보입니다.

a) kubectl get pods -o wide --all-namespaces
b) kubectl describe pod calico-kube-controllers-6c954486fd-tmlnl -n kube-system
c) kubectl logs calico-kube-controllers-6c954486fd-tmlnl -n kube-system
d) kubectl get events -n kube-system
e) kubectl describe node master

 

그 밖에도 수많은 kubectl 명령어들이 있습니다.

때로는 pods 이외의 deployments, replicaset, services.. DOCKER Process로 기동된 Docker Container 정보들의 경우 docker logs로 확인해야 할 수도 있으며, /var/log/messages 또는 /etc/kubernetes 하위, 인증서... 등등 수많은 포인트를 확인해야 할 것입니다.

 

금일 살펴본 case는 단순히 kubernetes에서 발생할 수 있는 수많은 이슈 중 하나입니다.

본 포스팅을 통해 해당 이슈를 본인의 것으로 만드는 것도 중요하지만 문제가 발생했을때 어떠한 과정으로 트러블슈팅을 진행해야 하는지에 대해 다시한번 고민해 봐야 할 것입니다.

728x90
반응형