티스토리 뷰
본 포스팅에서는 Kubernetes를 기동하거나 운영시 장애가 발생했을 경우 어떻게 트러블슈팅을 진행해야 하는지에 대해 알아보도록 하겠습니다.
이제 Kubernetes는 Cloud를 지향하는 모든 고객사에서 필수적인 요소가 되었습니다. CSP, On-prem 가릴것 없이 Kubernetes를 설계할 수 있어야 합니다.
다만 현 대부분의 고객사들이나 운영자들은 Kubernetes의 완전한 이해 이전에 먼저 설치하고 부딪혀 보기를 원하지만, 생각보다 이는 쉽지 않은 일입니다.
이에 이번 포스팅에서는 최소한 Kubernetes에 문제가 발생했을 경우 어떻게 문제를 찾아 내고 진단해야 하는지 몇가지 스텝을 통해 살펴보고자 합니다.
# 본 예시는 하나의 장애 상황을 가정하고 이를 해결해 나가는 방법에 대해 설명하고자합니다.
장애 발생
문제. Kubernetes 기동 시 calico-kube-controller & coredns pending 상태로 유지되는 현상
1. 최초 kubectl get pods -o wide --all-namespaces 명령어로 현재 Kubernetes에 의해 배포된 pods들의 상태를 확인해야 한다.
[kubectl get pods -o wide --all-namespaces]
[root@kubemaster success_log]# kubectl get pods -o wide --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-6c954486fd-tmlnl 0/1 Pending 0 25m
kube-system calico-node-ksmgr 1/1 Running 1 26m 172.21.20.51 kubeworker
kube-system calico-node-rr5gm 1/1 Running 1 26m 172.21.20.50 kubemaster
kube-system coredns-7ff7988d-rd2l5 0/1 Pending 0 25m
kube-system dns-autoscaler-748467c887-bhqxv 0/1 Pending 0 25m
kube-system kube-apiserver-kubemaster 1/1 Running 0 27m 172.21.20.50 kubemaster
kube-system kube-controller-manager-kubemaster 1/1 Running 0 27m 172.21.20.50 kubemaster
kube-system kube-proxy-85csh 1/1 Running 0 26m 172.21.20.51 kubeworker
kube-system kube-proxy-9p9z7 1/1 Running 0 27m 172.21.20.50 kubemaster
kube-system kube-scheduler-kubemaster 1/1 Running 0 27m 172.21.20.50 kubemaster
kube-system kubernetes-dashboard-76b866f6cf-xfnm7 0/1 Pending 0 25m
kube-system nginx-proxy-kubeworker 1/1 Running 0 26m 172.21.20.51 kubeworker
kube-system nodelocaldns-kr7q5 1/1 Running 0 25m 172.21.20.50 kubemaster
kube-system nodelocaldns-plh7m 1/1 Running 0 25m 172.21.20.51 kubeworker
[root@kubemaster success_log]#
TroubleShooting
Kubernetes가 일정 시간동안 기동이 완료되지 않고 Pending 상태로 유지되는 현상 발생
2. Pending 상태의 pod에 대한 상세정보를 확인하기 위해 kubectl describe pod calico-kube-controllers-6c954486fd-tmlnl -n kube-system 명령어로 상세 정보를 확인한다.
[kubectl descrive pod [pod_name]]
[root@kubemaster success_log]# kubectl describe pod calico-kube-controllers-6c954486fd-tmlnl -n kube-system
Name: calico-kube-controllers-6c954486fd-tmlnl
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node:
Labels: k8s-app=calico-kube-controllers
kubernetes.io/cluster-service=true
pod-template-hash=6c954486fd
Annotations:
Status: Pending
IP:
IPs:
Controlled By: ReplicaSet/calico-kube-controllers-6c954486fd
Containers:
calico-kube-controllers:
Image: 172.31.85.33:13000/calico/kube-controllers:v3.7.3
Port:
Host Port:
Limits:
cpu: 100m
memory: 256M
Requests:
cpu: 30m
memory: 64M
Readiness: exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
ETCD_ENDPOINTS: https://172.21.20.50:2379
ETCD_CA_CERT_FILE: /etc/calico/certs/ca_cert.crt
ETCD_CERT_FILE: /etc/calico/certs/cert.crt
ETCD_KEY_FILE: /etc/calico/certs/key.pem
Mounts:
/etc/calico/certs from etcd-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from calico-kube-controllers-token-n4wwv (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
etcd-certs:
Type: HostPath (bare host directory volume)
Path: /etc/calico/certs
HostPathType:
calico-kube-controllers-token-n4wwv:
Type: Secret (a volume populated by a Secret)
SecretName: calico-kube-controllers-token-n4wwv
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling default-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling default-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
[root@kubemaster success_log]# kubectl get replicaset --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system calico-kube-controllers-6c954486fd 1 1 0 27m
kube-system coredns-7ff7988d 1 1 0 27m
kube-system dns-autoscaler-748467c887 1 1 0 27m
kube-system kubernetes-dashboard-76b866f6cf 1 1 0 27m
[root@kubemaster success_log]#
현재 describe 상태로 확인한 결과 위 빨간색 내용처럼 n node(s) had taints that the pod didn`t tolerate.라는 메시지가 pending 상태의 원인임을 알수 있습니다.
로그 메시지와 같이 Kubernetes는 POD를 배치하기 위한 taint와 tolerate를 지정할 수 있습니다.
이에 대한 자세한 부분은 이후에 다루기로 하며, taint로 지정된 role을 허용하기 위한 tolerate가 지정되어야 하지만, 현재 이를 만족하지 못해 해당 노드에서 pod를 기동하지 못하는 상황임을 추정할 수 있습니다.
3. 일반적으로 해당 정보가 확인되면 보다 상세한 정보를 확인하기 위해 kubectl logs calico-kube-controllers-6c954486fd-tmlnl -n kube-system pod의 logs를 직접 확인해야 합니다.
[kubectl logs [pod_name] -n [namespace_name]]
[root@kubemaster success_log]# kubectl logs calico-kube-controllers-6c954486fd-tmlnl -n kube-system
[root@kubemaster success_log]#
다만 kubectl logs의 경우 pods가 배치된 이후 기동 정보를 기록하기 때문에 pod배치 자체가 불가능한 현 상태의 로그를 전혀 기록되어 있지 않은 것을 알 수 있습니다.
이를 기반으로 다음 TroubleShooting을 진행해 보겠습니다.
4. 로그가 발생되지 않았기에 Kubernetes에 의해 발생한 모든 event 정보를 확인하기 위해 kubectl get events -n kube-system를 확인합니다.
[kubectl get events -n kube-system]
[root@kubemaster success_log]# kubectl get events -n kube-system
LAST SEEN TYPE REASON OBJECT MESSAGE
Warning FailedScheduling pod/calico-kube-controllers-6c954486fd-tmlnl 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling pod/calico-kube-controllers-6c954486fd-tmlnl 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
31m Warning FailedCreate replicaset/calico-kube-controllers-6c954486fd Error creating: pods "calico-kube-controllers-6c954486fd-" is forbidden: error looking up service account kube-system/calico-kube-controllers: serviceaccount "calico-kube-controllers" not found
31m Normal SuccessfulCreate replicaset/calico-kube-controllers-6c954486fd Created pod: calico-kube-controllers-6c954486fd-tmlnl
31m Normal ScalingReplicaSet deployment/calico-kube-controllers Scaled up replica set calico-kube-controllers-6c954486fd to 1
Normal Scheduled pod/calico-node-ksmgr Successfully assigned kube-system/calico-node-ksmgr to kubeworker
31m Normal Pulled pod/calico-node-ksmgr Container image "172.31.85.33:13000/calico/cni:v3.7.3" already present on machine
31m Normal Created pod/calico-node-ksmgr Created container install-cni
31m Normal Started pod/calico-node-ksmgr Started container install-cni
31m Normal Pulled pod/calico-node-ksmgr Container image "172.31.85.33:13000/calico/node:v3.7.3" already present on machine
31m Normal Created pod/calico-node-ksmgr Created container calico-node
31m Normal Started pod/calico-node-ksmgr Started container calico-node
31m Warning Unhealthy pod/calico-node-ksmgr Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 172.21.20.50
2019-10-25 05:22:06.381 [INFO][179] readiness.go 88: Number of node(s) with BGP peering established = 0
31m Normal Killing pod/calico-node-ksmgr Stopping container calico-node
Normal Scheduled pod/calico-node-rr5gm Successfully assigned kube-system/calico-node-rr5gm to kubemaster
31m Normal Pulled pod/calico-node-rr5gm Container image "172.31.85.33:13000/calico/cni:v3.7.3" already present on machine
31m Normal Created pod/calico-node-rr5gm Created container install-cni
31m Normal Started pod/calico-node-rr5gm Started container install-cni
31m Normal Pulled pod/calico-node-rr5gm Container image "172.31.85.33:13000/calico/node:v3.7.3" already present on machine
31m Normal Created pod/calico-node-rr5gm Created container calico-node
31m Normal Started pod/calico-node-rr5gm Started container calico-node
31m Normal Killing pod/calico-node-rr5gm Stopping container calico-node
31m Warning Unhealthy pod/calico-node-rr5gm Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
31m Warning FailedCreate daemonset/calico-node Error creating: pods "calico-node-" is forbidden: error looking up service account kube-system/calico-node: serviceaccount "calico-node" not found
31m Normal SuccessfulCreate daemonset/calico-node Created pod: calico-node-ksmgr
31m Normal SuccessfulCreate daemonset/calico-node Created pod: calico-node-rr5gm
Warning FailedScheduling pod/coredns-7ff7988d-rd2l5 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling pod/coredns-7ff7988d-rd2l5 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
30m Normal SuccessfulCreate replicaset/coredns-7ff7988d Created pod: coredns-7ff7988d-rd2l5
30m Normal ScalingReplicaSet deployment/coredns Scaled up replica set coredns-7ff7988d to 1
Warning FailedScheduling pod/dns-autoscaler-748467c887-bhqxv 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling pod/dns-autoscaler-748467c887-bhqxv 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
30m Normal SuccessfulCreate replicaset/dns-autoscaler-748467c887 Created pod: dns-autoscaler-748467c887-bhqxv
30m Normal ScalingReplicaSet deployment/dns-autoscaler Scaled up replica set dns-autoscaler-748467c887 to 1
33m Normal Pulled pod/kube-apiserver-kubemaster Container image "172.31.85.33:13000/google-containers/kube-apiserver:v1.16.0" already present on machine
33m Normal Created pod/kube-apiserver-kubemaster Created container kube-apiserver
33m Normal Started pod/kube-apiserver-kubemaster Started container kube-apiserver
33m Normal Pulled pod/kube-controller-manager-kubemaster Container image "172.31.85.33:13000/google-containers/kube-controller-manager:v1.16.0" already present on machine
33m Normal Created pod/kube-controller-manager-kubemaster Created container kube-controller-manager
33m Normal Started pod/kube-controller-manager-kubemaster Started container kube-controller-manager
32m Normal Pulled pod/kube-controller-manager-kubemaster Container image "172.31.85.33:13000/google-containers/kube-controller-manager:v1.16.0" already present on machine
32m Normal Created pod/kube-controller-manager-kubemaster Created container kube-controller-manager
32m Normal Started pod/kube-controller-manager-kubemaster Started container kube-controller-manager
33m Normal LeaderElection endpoints/kube-controller-manager kubemaster_cd17761a-5ec7-419b-ba0b-c15877040954 became leader
32m Normal LeaderElection endpoints/kube-controller-manager kubemaster_860a0eeb-16ba-4380-8011-8412d3b5e241 became leader
Normal Scheduled pod/kube-proxy-85csh Successfully assigned kube-system/kube-proxy-85csh to kubeworker
31m Normal Pulled pod/kube-proxy-85csh Container image "172.31.85.33:13000/google-containers/kube-proxy:v1.16.0" already present on machine
31m Normal Created pod/kube-proxy-85csh Created container kube-proxy
31m Normal Started pod/kube-proxy-85csh Started container kube-proxy
Normal Scheduled pod/kube-proxy-9p9z7 Successfully assigned kube-system/kube-proxy-9p9z7 to kubemaster
32m Normal Pulled pod/kube-proxy-9p9z7 Container image "172.31.85.33:13000/google-containers/kube-proxy:v1.16.0" already present on machine
32m Normal Created pod/kube-proxy-9p9z7 Created container kube-proxy
32m Normal Started pod/kube-proxy-9p9z7 Started container kube-proxy
32m Normal SuccessfulCreate daemonset/kube-proxy Created pod: kube-proxy-9p9z7
31m Normal SuccessfulCreate daemonset/kube-proxy Created pod: kube-proxy-85csh
33m Normal Pulled pod/kube-scheduler-kubemaster Container image "172.31.85.33:13000/google-containers/kube-scheduler:v1.16.0" already present on machine
33m Normal Created pod/kube-scheduler-kubemaster Created container kube-scheduler
33m Normal Started pod/kube-scheduler-kubemaster Started container kube-scheduler
32m Normal Pulled pod/kube-scheduler-kubemaster Container image "172.31.85.33:13000/google-containers/kube-scheduler:v1.16.0" already present on machine
32m Normal Created pod/kube-scheduler-kubemaster Created container kube-scheduler
32m Normal Started pod/kube-scheduler-kubemaster Started container kube-scheduler
33m Normal LeaderElection endpoints/kube-scheduler kubemaster_6aca6003-392a-4b81-827c-34b05c8f7fb6 became leader
32m Normal LeaderElection endpoints/kube-scheduler kubemaster_57750948-7d68-44c3-90ed-7d89b7f8fab5 became leader
Warning FailedScheduling pod/kubernetes-dashboard-76b866f6cf-xfnm7 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling pod/kubernetes-dashboard-76b866f6cf-xfnm7 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
30m Normal SuccessfulCreate replicaset/kubernetes-dashboard-76b866f6cf Created pod: kubernetes-dashboard-76b866f6cf-xfnm7
30m Normal ScalingReplicaSet deployment/kubernetes-dashboard Scaled up replica set kubernetes-dashboard-76b866f6cf to 1
31m Normal Pulled pod/nginx-proxy-kubeworker Container image "172.31.85.33:13000/library/nginx:1.15" already present on machine
31m Normal Created pod/nginx-proxy-kubeworker Created container nginx-proxy
31m Normal Started pod/nginx-proxy-kubeworker Started container nginx-proxy
Normal Scheduled pod/nodelocaldns-kr7q5 Successfully assigned kube-system/nodelocaldns-kr7q5 to kubemaster
30m Normal Pulled pod/nodelocaldns-kr7q5 Container image "172.31.85.33:13000/google-containers/k8s-dns-node-cache:1.15.5" already present on machine
30m Normal Created pod/nodelocaldns-kr7q5 Created container node-cache
30m Normal Started pod/nodelocaldns-kr7q5 Started container node-cache
Normal Scheduled pod/nodelocaldns-plh7m Successfully assigned kube-system/nodelocaldns-plh7m to kubeworker
30m Normal Pulled pod/nodelocaldns-plh7m Container image "172.31.85.33:13000/google-containers/k8s-dns-node-cache:1.15.5" already present on machine
30m Normal Created pod/nodelocaldns-plh7m Created container node-cache
30m Normal Started pod/nodelocaldns-plh7m Started container node-cache
30m Normal SuccessfulCreate daemonset/nodelocaldns Created pod: nodelocaldns-kr7q5
30m Normal SuccessfulCreate daemonset/nodelocaldns Created pod: nodelocaldns-plh7m
[root@kubemaster success_log]#
5. 다음으로 kubernetes에 join되어 있는 node들의 상태 정보를 확인하기 위해 kubectl describe node master를 실행합니다.
taint와 toleration는 node 기반으로 부여되기 때문에 현재 노드 상태를 확인하고 pods가 배치 될 수 있는 상태인지 확인합니다.
[kubectl describe node [node name]]
[root@kubemaster kubespray]# kubectl describe node kubemaster
Name: kubemaster
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=kubemaster
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: alpha.kubernetes.io/provided-node-ip: 172.21.20.50
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 25 Oct 2019 03:57:24 -0400
Taints: node-role.kubernetes.io/master:NoSchedule
node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 25 Oct 2019 03:59:17 -0400 Fri, 25 Oct 2019 03:59:17 -0400 CalicoIsUp Calico is running on this node
MemoryPressure False Fri, 25 Oct 2019 04:00:54 -0400 Fri, 25 Oct 2019 03:57:21 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 25 Oct 2019 04:00:54 -0400 Fri, 25 Oct 2019 03:57:21 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 25 Oct 2019 04:00:54 -0400 Fri, 25 Oct 2019 03:57:21 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 25 Oct 2019 04:00:54 -0400 Fri, 25 Oct 2019 03:59:14 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.21.20.50
Hostname: kubemaster
Capacity:
cpu: 16
ephemeral-storage: 104846316Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16264568Ki
pods: 110
Allocatable:
cpu: 15800m
ephemeral-storage: 96626364666
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15662168Ki
pods: 110
System Info:
Machine ID: 0fe9ccd97b313b47afe0d301b5145d20
System UUID: 7880C4F4-A248-47B6-AD36-FC3C6C4A5E52
Boot ID: 58af9a9a-1fe2-4dff-8e08-ce213ee25951
Kernel Version: 3.10.0-957.21.3.el7.x86_64
OS Image: Red Hat Enterprise Linux Server 7.6 (Maipo)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.4
Kubelet Version: v1.16.0
Kube-Proxy Version: v1.16.0
PodCIDR: 10.233.64.0/24
PodCIDRs: 10.233.64.0/24
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-mvbkm 150m (0%) 300m (1%) 64M (0%) 500M (3%) 111s
kube-system kube-apiserver-kubemaster 250m (1%) 0 (0%) 0 (0%) 0 (0%) 2m55s
kube-system kube-controller-manager-kubemaster 200m (1%) 0 (0%) 0 (0%) 0 (0%) 2m55s
kube-system kube-proxy-cwtdz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m15s
kube-system kube-scheduler-kubemaster 100m (0%) 0 (0%) 0 (0%) 0 (0%) 2m55s
kube-system nodelocaldns-lwqqk 100m (0%) 0 (0%) 70Mi (0%) 170Mi (1%) 74s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 800m (5%) 300m (1%)
memory 137400320 (0%) 678257920 (4%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 3m40s kubelet, kubemaster Starting kubelet.
Normal NodeHasSufficientMemory 3m39s (x7 over 3m40s) kubelet, kubemaster Node kubemaster status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 3m39s (x8 over 3m40s) kubelet, kubemaster Node kubemaster status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 3m39s (x8 over 3m40s) kubelet, kubemaster Node kubemaster status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 3m39s kubelet, kubemaster Updated Node Allocatable limit across pods
Normal Starting 2m55s kubelet, kubemaster Starting kubelet.
Normal NodeAllocatableEnforced 2m55s kubelet, kubemaster Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 2m55s kubelet, kubemaster Node kubemaster status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m55s kubelet, kubemaster Node kubemaster status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 2m55s kubelet, kubemaster Node kubemaster status is now: NodeHasSufficientPID
Normal Starting 2m35s kube-proxy, kubemaster Starting kube-proxy.
Normal NodeReady 105s kubelet, kubemaster Node kubemaster status is now: NodeReady
[root@kubemaster kubespray]# kubectl describe node kubeworker
Name: kubeworker
Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=kubeworker
kubernetes.io/os=linux
Annotations: alpha.kubernetes.io/provided-node-ip: 172.21.20.50
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 25 Oct 2019 03:58:57 -0400
Taints: node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 25 Oct 2019 03:59:18 -0400 Fri, 25 Oct 2019 03:59:18 -0400 CalicoIsUp Calico is running on this node
MemoryPressure False Fri, 25 Oct 2019 04:01:29 -0400 Fri, 25 Oct 2019 03:58:58 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 25 Oct 2019 04:01:29 -0400 Fri, 25 Oct 2019 03:58:58 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 25 Oct 2019 04:01:29 -0400 Fri, 25 Oct 2019 03:58:58 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 25 Oct 2019 04:01:29 -0400 Fri, 25 Oct 2019 03:59:18 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.21.20.50
Hostname: kubeworker
Capacity:
cpu: 16
ephemeral-storage: 104846316Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16264568Ki
pods: 110
Allocatable:
cpu: 15900m
ephemeral-storage: 96626364666
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15912168Ki
pods: 110
System Info:
Machine ID: 0fe9ccd97b313b47afe0d301b5145d20
System UUID: A06FB514-89B8-46E5-8A39-3BB3606B6158
Boot ID: 47284b35-dd79-48f7-9511-89a7bdfc01a5
Kernel Version: 3.10.0-957.21.3.el7.x86_64
OS Image: Red Hat Enterprise Linux Server 7.6 (Maipo)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.4
Kubelet Version: v1.16.0
Kube-Proxy Version: v1.16.0
PodCIDR: 10.233.65.0/24
PodCIDRs: 10.233.65.0/24
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-dpg8x 150m (0%) 300m (1%) 64M (0%) 500M (3%) 2m26s
kube-system kube-proxy-mrrvb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m37s
kube-system nginx-proxy-kubeworker 25m (0%) 0 (0%) 32M (0%) 0 (0%) 2m36s
kube-system nodelocaldns-kq8kh 100m (0%) 0 (0%) 70Mi (0%) 170Mi (1%) 109s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 275m (1%) 300m (1%)
memory 169400320 (1%) 678257920 (4%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 2m36s kubelet, kubeworker Starting kubelet.
Normal NodeHasSufficientMemory 2m36s (x2 over 2m36s) kubelet, kubeworker Node kubeworker status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m36s (x2 over 2m36s) kubelet, kubeworker Node kubeworker status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 2m36s (x2 over 2m36s) kubelet, kubeworker Node kubeworker status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 2m36s kubelet, kubeworker Updated Node Allocatable limit across pods
Normal Starting 2m35s kube-proxy, kubeworker Starting kube-proxy.
Normal NodeReady 2m16s kubelet, kubeworker Node kubeworker status is now: NodeReady
[root@kubemaster kubespray]#
- Kubernetes Master Node & Worker Node 모두 각각 taint로 지정되어 있는 node.cloudprovider.kubernetes.io/uninitialized와 Master Node에는 Application Pod를 배치하지 않도록 하기 위한 node-role.kubernetes.io/master:NoSchedule 가 추가로 배정되어 있습니다.
Kubernetes에서는 아래와 같은 Taint를 지정하여 노드들의 상태를 관리하고 배치할 수 있습니다.
node.kubernetes.io/not-ready: 노드가 준비되지 않았습니다. 이는 NodeCondition Ready이 " False"에 해당합니다 .
node.kubernetes.io/unreachable: 노드 컨트롤러에서 노드에 도달 할 수 없습니다. 이는 NodeCondition Ready이 " Unknown"에 해당합니다 .
node.kubernetes.io/out-of-disk: 노드가 디스크에서 벗어납니다.
node.kubernetes.io/memory-pressure: 노드에 메모리 압력이 있습니다.
node.kubernetes.io/disk-pressure: 노드에 디스크 압력이 있습니다.
node.kubernetes.io/network-unavailable: 노드의 네트워크를 사용할 수 없습니다.
node.kubernetes.io/unschedulable: 노드를 예약 할 수 없습니다.
node.cloudprovider.kubernetes.io/uninitialized: kubelet이 "외부"클라우드 공급자로 시작되면이 taint는 노드에 설정되어 사용할 수없는 것으로 표시됩니다. 클라우드 컨트롤러 관리자의 컨트롤러가 이 노드를 초기화하면 kubelet이 이 오염을 제거합니다.
현재 node.cloudprovider.kubernetes.io/uninitializedtaint에 대한 toleration이 지정되지 않아 master node에 calico를 기동할 수 없는 상황임을 알 수 있습니다.
따라서 다음과 같이 taint에 대한 조건을 해제하거나, toleration을 추가해야 합니다.
현재까지 확인 된 사항을 기반으로 원인을 진단해 보자면, taint가 추가된 kube-master node에 calico-pod를 배치하고자 하였으나, toleration이 지정되어 있지 않아 pod가 배치되지 않은 문제로 확인할 수 있습니다.
이를 해결하기 위해서는 toleration을 추가하거나, taint를 제거하는 방법이 있을 수 있습니다.
6. 간단히 toleration을 제거하기 위해 kubectl taint nodes kubemaster node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule-로 taint를 제거할 수 있습니다.
[root@kubemaster kubespray]# kubectl taint nodes kubemaster node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule-
node/kubemaster untainted
[root@kubemaster kubespray]# kubectl taint nodes kubemaster node-role.kubernetes.io/master:NoSchedule-
node/kubemaster untainted
[root@kubemaster kubespray]# kubectl taint nodes kubeworker node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule-
node/kubeworker untainted
[root@kubemaster kubespray]#
위와 같이 taint를 제거하면 다음과 같이 pod들이 기동되는 것을 확인할 수 있습니다.
[root@kubemaster kubespray]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-6c954486fd-2j5xg 1/1 Running 0 4m8s
kube-system calico-node-dpg8x 1/1 Running 1 4m26s
kube-system calico-node-mvbkm 1/1 Running 1 4m26s
kube-system coredns-7ff7988d-c74gd 1/1 Running 2 3m55s
kube-system coredns-7ff7988d-fnp5x 0/1 Error 0 43s
kube-system dns-autoscaler-748467c887-g7flm 1/1 Running 0 3m51s
kube-system kube-apiserver-kubemaster 1/1 Running 0 5m30s
kube-system kube-controller-manager-kubemaster 1/1 Running 0 5m30s
kube-system kube-proxy-cwtdz 1/1 Running 0 5m50s
kube-system kube-proxy-mrrvb 1/1 Running 0 4m37s
kube-system kube-scheduler-kubemaster 1/1 Running 0 5m30s
kube-system kubernetes-dashboard-76b866f6cf-vrrqg 1/1 Running 0 3m47s
kube-system nginx-proxy-kubeworker 1/1 Running 0 4m36s
kube-system nodelocaldns-kq8kh 1/1 Running 0 3m49s
kube-system nodelocaldns-lwqqk 1/1 Running 0 3m49s
[root@kubemaster kubespray]#
물론 해결 방안으로는 옳바른 toleration을 추가하는 것이 바람직하며, 이에 대한 설명은 이후 과정에서 다시한번 자세히 다루도록 하겠습니다.
간단히 예를 들어 toleration을 추가해 보자면 다음과 같은 형태로 taints에 대한 toleration을 추가하여 pod 배치를 허용할 수 있습니다.
...
...
tolerations:
# This taint is set by all kubelets running `--cloud-provider=external`
# so we should tolerate it to schedule the calico pods
- key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
...
...
본 포스팅에서 살펴본 정보입니다.
a) kubectl get pods -o wide --all-namespaces
b) kubectl describe pod calico-kube-controllers-6c954486fd-tmlnl -n kube-system
c) kubectl logs calico-kube-controllers-6c954486fd-tmlnl -n kube-system
d) kubectl get events -n kube-system
e) kubectl describe node master
그 밖에도 수많은 kubectl 명령어들이 있습니다.
때로는 pods 이외의 deployments, replicaset, services.. DOCKER Process로 기동된 Docker Container 정보들의 경우 docker logs로 확인해야 할 수도 있으며, /var/log/messages 또는 /etc/kubernetes 하위, 인증서... 등등 수많은 포인트를 확인해야 할 것입니다.
금일 살펴본 case는 단순히 kubernetes에서 발생할 수 있는 수많은 이슈 중 하나입니다.
본 포스팅을 통해 해당 이슈를 본인의 것으로 만드는 것도 중요하지만 문제가 발생했을때 어떠한 과정으로 트러블슈팅을 진행해야 하는지에 대해 다시한번 고민해 봐야 할 것입니다.
'③ 클라우드 > ⓚ Kubernetes' 카테고리의 다른 글
Kubernetes ErrImagePull - Using Docker Private Registry (0) | 2019.12.15 |
---|---|
Kubernetes - coredns CrashLoopBackOff & Restarting 현상 조치방안 (0) | 2019.10.29 |
Kubespray - OpenStack 기반 Kubernetes installation (using ansible) (0) | 2019.10.25 |
Kubernetes - Kuberspray로 Kubernetes 구축하기 (14) | 2019.10.13 |
Kubernetes - Kuberspray Private Registry 적용하기 (7) | 2019.10.11 |
- Total
- Today
- Yesterday
- webtob
- aa
- 쿠버네티스
- JBoss
- 아키텍처
- nodejs
- jeus
- openstack token issue
- node.js
- 오픈스택
- apache
- wildfly
- API Gateway
- Docker
- 마이크로서비스 아키텍처
- MSA
- SA
- JEUS7
- JEUS6
- git
- k8s
- OpenStack
- kubernetes
- aws
- SWA
- 마이크로서비스
- TA
- Da
- openstack tenant
- Architecture
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |