k8s集群

常用命令

缩写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
certificatesigningrequests (缩写 csr)
componentstatuses (缩写 cs)
configmaps (缩写 cm)
customresourcedefinition (缩写 crd)
daemonsets (缩写 ds)
deployments (缩写 deploy)
endpoints (缩写 ep)
events (缩写 ev)
horizontalpodautoscalers (缩写 hpa)
ingresses (缩写 ing)
limitranges (缩写 limits)
namespaces (缩写 ns)
networkpolicies (缩写 netpol)
nodes (缩写 no)
persistentvolumeclaims (缩写 pvc)
persistentvolumes (缩写 pv)
poddisruptionbudgets (缩写 pdb)
pods (缩写 po)
podsecuritypolicies (缩写 psp)
replicasets (缩写 rs)
replicationcontrollers (缩写 rc)
resourcequotas (缩写 quota)
serviceaccounts (缩写 sa)
services (缩写 svc)
statefulsets (缩写 sts)
storageclasses (缩写 sc)

自动补全

1
2
3
4
5
6
sudo apt install bash-completion

source /usr/share/bash-completion/bash_completion
source <(kubectl completion bash)

echo "source <(kubectl completion bash)" >> ~/.bashrc

/usr/bin/zsh /usr/bin/bash

cs(master节点)

componentstatuses

1
2
3
4
5
6
7
cs@debian:~$ kubectl get cs
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-2 Healthy {"health": "true"}
etcd-0 Healthy {"health": "true"}
etcd-1 Healthy {"health": "true"}

node节点

1
2
3
4
5
6
7
8
9
cs@debian:~$ kubectl  get node
NAME STATUS ROLES AGE VERSION
master02 Ready <none> 101d v1.18.8
master03 Ready <none> 101d v1.18.8
node04 Ready <none> 101d v1.18.8
node05 Ready <none> 101d v1.18.8
node06 Ready <none> 101d v1.18.8

kubectl get node -o wide

空间

1
2
3
4
5
6
7
8
cs@debian:~$ kubectl get namespaces 
NAME STATUS AGE
default Active 101d
devops Active 100d
kube-node-lease Active 101d
kube-public Active 101d
kube-system Active 101d
kubernetes-dashboard Active 92d

pod

1
2
3
4
5
6
7
8
cs@debian:~$ kubectl get pod
No resources found in default namespace.

cs@debian:~$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-56ff7bc666-prc6l 1/1 Running 4 11d
coredns-56ff7bc666-qwdsh 1/1 Running 8 92d
traefik-ingress-controller-7769cb875-x76rs 1/1 Running 1 46h

-n 接namespaces的NAME值,省略为default

1
2
kubectl get pods -o wide
kubectl get pods -A -o wide

-o wide 显示资源的额外信息

-A 所有pod资源

describe

ingress Tab提示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
cs@debian:~$ kubectl get ingress -n devops 
ingressclasses.networking.k8s.io ingresses.networking.k8s.io ingressroutetcps.traefik.containo.us
ingresses.extensions ingressroutes.traefik.containo.us ingressrouteudps.traefik.containo.us

cs@debian:~$ kubectl get ingressroutetcps.traefik.containo.us -n devops
NAME AGE
redis 47h

cs@debian:~$ kubectl describe ingressroutetcps.traefik.containo.us redis -n devops
Name: redis
Namespace: devops
Labels: <none>
Annotations: API Version: traefik.containo.us/v1alpha1
Kind: IngressRouteTCP
Metadata:
Creation Timestamp: 2022-07-19T13:01:37Z
Generation: 1
Managed Fields:
API Version: traefik.containo.us/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:entryPoints:
f:routes:
Manager: kubectl
Operation: Update
Time: 2022-07-19T13:01:37Z
Resource Version: 261095
Self Link: /apis/traefik.containo.us/v1alpha1/namespaces/devops/ingressroutetcps/redis
UID: c79681ee-1bf1-4843-9666-457970d78f27
Spec:
Entry Points:
redis
Routes:
Match: HostSNI(`*`)
Services:
Name: redis-service
Port: 6379
Events: <none>

log

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cs@debian:~$ kubectl logs  --tail=5 redis-app-1 -n devops
63:M 21 Jul 2022 12:15:34.720 * Synchronization with replica 121.21.25.3:6379 succeeded
63:M 21 Jul 2022 12:15:36.469 # Cluster state changed: ok
63:M 21 Jul 2022 12:15:40.604 * FAIL message received from 67f931358b8004268db0b57932293602ab3de629 about b49a123e2764b665ee898c21f983b9cda70cda00
63:M 21 Jul 2022 12:15:42.635 * Marking node a0e2f50ba382870da1ce4d23b66a1375826d6dc8 as failing (quorum reached).
63:M 21 Jul 2022 12:15:42.635 # Cluster state changed: fail

cs@debian:~$ kubectl logs -f --tail=5 redis-app-1 -n devops
63:M 21 Jul 2022 12:15:34.720 * Synchronization with replica 121.21.25.3:6379 succeeded
63:M 21 Jul 2022 12:15:36.469 # Cluster state changed: ok
63:M 21 Jul 2022 12:15:40.604 * FAIL message received from 67f931358b8004268db0b57932293602ab3de629 about b49a123e2764b665ee898c21f983b9cda70cda00
63:M 21 Jul 2022 12:15:42.635 * Marking node a0e2f50ba382870da1ce4d23b66a1375826d6dc8 as failing (quorum reached).
63:M 21 Jul 2022 12:15:42.635 # Cluster state changed: fail
^C

-f 类似 tail -f

-p, –previous[=false]: 如果为true,输出pod中曾经运行过,但目前已终止的容器的日志
–since=0: 仅返回相对时间范围,如5s、2m或3h,之内的日志。默认返回所有日志。只能同时使用since 和since-time中的一种
–since-time=””: 仅返回指定时间(RFC3339格式)之后的日志。默认返回所有日志。只能同时使用since和since-time中的一种
–tail=-1: 要显示的最新的日志条数。默认为-1,显示所有的日志

patch

更新容器的镜像

1
2
3
kubectl patch pod valid-pod -p '{"spec":{"containers":[{"name":"kubernetes-serve-hostname","image":"new image"}]}}'

kubectl patch pod valid-pod --type='json' -p='[{"op": "replace", "path": "/spec/containers/0/image", "value":"new image"}]'

设置服务对外的IP

1
kubectl patch svc <svc-name> -n <namespace> -p '{"spec": {"type": "LoadBalancer", "externalIPs":["192.168.31.241"]}}'

scale

对副本数进行扩展或缩小

前提条件校验 ;当前副本数量或 --resource-version

缩减副本数到2

1
kubectl scale rc rc-nginx-3 —replicas=2

当前副本数为2,则将其扩展至3

1
kubectl scale --current-replicas=2 --replicas=3 deployment/mysql

port-forward

https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#port-forward

把本地的port 映射到pod的port,便于测试效果

1
kubectl port-forward deployment/redis-master 6379:6379 

重启

1
kubectl rollout restart deployment <deployment_name> -n <namespace>

选择器

–field-selector

status.podIP

1
2
3
cs@debian:~/oss/hexo$  kubectl  get pod --field-selector status.podIP=121.21.35.3 -o wide -n devops
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-app-1 1/1 Running 1 9d 121.21.35.3 node04 <none> <none>
  1. metadata.name=my-service
  2. metadata.namespace!=default
  3. status.phase=Pending

选择了所有**status.phase不为Runningspec.restartPolicyAlways**的Pod.

1
kubectl get pods --field-selector=status.phase!=Running,spec.restartPolicy=Always

events

显示集群内的详细事件,如果最近出现故障,你可以查看集群事件以了解故障前后发生的情况。如果你知道只有特定名称空间中存在问题,你可以将事件过滤到该名称空间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ kubectl get events 
LAST SEEN TYPE REASON OBJECT MESSAGE
2d18h Normal Starting node/k8s Starting kubelet.
2d18h Normal NodeHasSufficientMemory node/k8s Node k8s status is now: NodeHasSufficientMemory
2d18h Normal NodeHasNoDiskPressure node/k8s Node k8s status is now: NodeHasNoDiskPressure
2d18h Normal NodeHasSufficientPID node/k8s Node k8s status is now: NodeHasSufficientPID
2d18h Normal NodeAllocatableEnforced node/k8s Updated Node Allocatable limit across pods
2d18h Normal Starting node/k8s Starting kubelet.
2d18h Normal NodeHasSufficientMemory node/k8s Node k8s status is now: NodeHasSufficientMemory
2d18h Normal NodeHasNoDiskPressure node/k8s Node k8s status is now: NodeHasNoDiskPressure
2d18h Normal NodeHasSufficientPID node/k8s Node k8s status is now: NodeHasSufficientPID
2d18h Normal NodeAllocatableEnforced node/k8s Updated Node Allocatable limit across pods
2d18h Normal NodeReady node/k8s Node k8s status is now: NodeReady
2d18h Normal RegisteredNode node/k8s Node k8s event: Registered Node k8s in Controller
2d18h Normal Starting node/k8s Starting kube-proxy.
2d18h Normal RegisteredNode node/k8s Node k8s event: Registered Node k8s in Controller
23m Normal Starting node/k8s Starting kubelet.
23m Normal NodeHasSufficientMemory node/k8s Node k8s status is now: NodeHasSufficientMemory
23m Normal NodeHasNoDiskPressure node/k8s Node k8s status is now: NodeHasNoDiskPressure
23m Normal NodeHasSufficientPID node/k8s Node k8s status is now: NodeHasSufficientPID
23m Normal NodeAllocatableEnforced node/k8s Updated Node Allocatable limit across pods
23m Normal Starting node/k8s Starting kube-proxy.
22m Normal RegisteredNode node/k8s Node k8s event: Registered Node k8s in Controller

api-resources

1
kubectl api-resources -o wide --sort-by name

调试 故障排除

启动参数--feature-gates=EphemeralContainers=true配置到kube-api和kubelet服务上重启

1
2
3
4
5
6
7
# 查看pod所在宿主及pod name
$ kubectl get po -o wide
# 根据pod name查看对应的docker 容器
$ docker ps | grep centos-687ff6c787-47gvh
# 根据输出的容器id,挂载容器网络并运行一个debug容器,使用 nicolaka/netshoot 这个镜像。这个镜像里集成了很多网络调试工具。
$ docker run -it --rm --name=debug --network=container:bb009aab414f nicolaka/netshoot bash
接下来就进入了与这个pod的相同的网络namespace,可以进行网络相关的调试了
1
2
3
4
for pod in $(kubectl get -o name pod  -n kube-system); 
do
kubectl debug --image security/pod_scanner -p $pod /sanner.sh
done

批量跑某个命名空间下的安全扫描的脚本而不用干扰原容器

没有开启Ephemeral Containers
1
2
3
4
5
6
kubectl debug mypod -it \
--container=debug \
--image=busybox \
--copy-to=my-debugger \
--same-node=true \
--share-processes=true

–copy-to 指定新pod的名称
–replace=true 是否删除原容器
–same-node=true 是否调度到和原容器一样的node上
–share-processes=true 是否共享容器pid空间

利用Ephemeral Containers
1
kubectl run ephemeral-demo --image=k8s.gcr.io/pause:3.1 --restart=Never

kubectl debug

1
2
#kubectl v1.12.0 或更高的版本, 可以直接使用:
kubectl debug -h

调试 Pod

1
2
3
4
5
❯ kubectl debug -it vault-1 -n vault --image=k8s.org/cs/netshoot  -- bash
Defaulting debug container name to debugger-z9zr4.
If you don't see a command prompt, try pressing enter.
vault-1:/root$

service ping不通

1
2
3
4
5
6
7
8
$  cat /var/lib/kubelet/config.yaml  | grep -A 1 DNS
clusterDNS:
- 10.96.1.10

$ zgrep "cluster-cidr\|cluster-ip" /etc/kubernetes/manifests/*
/etc/kubernetes/manifests/kube-apiserver.yaml: - --service-cluster-ip-range=10.96.0.0/12
/etc/kubernetes/manifests/kube-controller-manager.yaml: - --cluster-cidr=121.21.0.0/16
/etc/kubernetes/manifests/kube-controller-manager.yaml: - --service-cluster-ip-range=10.96.0.0/16

dial tcp: lookup vault-2.vault-internal on 121.21.0.0:53: read udp 121.21.64.55:53568->121.21.0.0:53 i/o timeout

cat /etc/resolv.conf dns(service clusterIP跟serviceSubnet)

发现vault-1 跟 vault-0.vault-internal ,vault-2.vault-internal都ping不通

pod ping不通

1
2
kube-flannel-cfg

[ERROR] storage.raft: failed to make requestVote RPC: target=”{Voter vault-0 vault-0.vault-internal:8201}” error=”dial tcp 121.21.80.174:8201: connect: connection refused” term=679

[ERROR] storage.raft: failed to make requestVote RPC: target=”{Voter vault-2 vault-2.vault-internal:8201}” error=”dial tcp 121.21.48.151:8201: connect: connection refused” term=679

清理调试 Pod

1
2
3
Creating debugging pod node-debugger-mynode-pdx84 with container debugger on node ....

❯kubectl delete pod node-debugger-mynode-pdx84
nsenter
1
sudo yum install -y util-linux
1
2
3
$ docker inspect -f {{.State.Pid}} nginx
#nsenter命令进入该容器的网络命令空间
$ nsenter -n -t6700

证书

检查有效期

1
kubeadm certs check-expiration

使用kubeadm搭建的集群,默认CA证书的有效期为10年,其他组件访问证书的有效期为1年

1
2
3
4
5
6
7
8
 #备份
cp -R /etc/kubernetes/pki /tmp/pki_backup
#在所有控制平面节点执行
[root@k8s01 ~]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

1
2
3
systemctl restart kubelet

cp /etc/kubernetes/admin.conf $HOME/.kube/config

memcache.go:265] couldn’t get current server API group list: the server has asked for the client to provide credentials
error: You must be logged in to the server (the server has asked for the client to provide credentials) 没有认证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
❯ kubectl get nodes
E0422 14:21:12.120751 102845 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0422 14:21:12.126398 102845 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0422 14:21:12.130678 102845 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0422 14:21:12.132943 102845 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME STATUS ROLES AGE VERSION
k8s01 Ready control-plane 399d v1.26.1
k8s02 Ready control-plane 399d v1.26.1
k8s03 Ready control-plane 399d v1.26.1
k8s04 Ready <none> 393d v1.26.1
k8s05 Ready <none> 393d v1.26.1
k8s06 Ready <none> 393d v1.26.1
k8s07 Ready <none> 179d v1.26.1
k8s08 Ready <none> 178d v1.26.1
❯ kubectl top node
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)

https://github.com/kubernetes-sigs/metrics-server/issues/157

1
2
❯ kubectl get pods  -o wide -A | grep kubernetes-dashboard
❯ kubectl top nodes

备份

备份整个Kubernetes集群

1. 备份etcd

etcd是Kubernetes集群的核心,存储了所有的集群状态和配置信息。备份etcd是确保可以恢复集群状态的关键步骤。

  • 使用etcdctl工具备份etcd数据:

    1
    etcdctl --endpoints <etcd-endpoints> snapshot save <backup-file-path>

备份脚本

backup_etcd.sh
backup_etcd.sh
#!/bin/bash
set -e
exec >> /var/log/backup_etcd.log

time=`date "+%Y-%m-%d %H:%M:%S #"` Endpoints="https://192.168.56.101:2379,https://192.168.56.102:2379,https://192.168.56.103:2379" EXEC="/usr/bin/etcdctl" SSL=/opt/k8s/ssl BackupDir="/home/www/backup/etcd" BackupFile="snapshot.db.$time"
echo "$time backup etcd start..."
export ETCDCTL_API=3 $EXEC --endpoints $Endpoints \ --cacert=$SSL/ca.pem \ --cert=$SSL/kubernetes.pem \ --key=$SSL/kubernetes-key.pem \ snapshot save $BackupDir/$BackupFile
echo "$time backup etcd end!"
#find $BackupDir -name "snapshot.db.*" -type f -mtime +7 -exec rm -rf {} \; > /dev/null 2>&1 num=2 pre="$BackupDir/snapshot.db" delf=`ls -lrt $pre.* | awk '{print $9 }' | head -1` count=`ls -lrt $pre.* | awk '{print $9 }' | wc -l` if [ $count -gt $num ] then rm $delf echo "$time delete $delf" fi

恢复脚本

restore.sh
restore
#!/bin/bash

# 定义备份文件路径 BACKUP_FILE="/var/backups/etcd/etcd_backup.db" SSL=/opt/k8s/ssl
#获取网卡ip wlan0 currentip=`ip addr show wlan0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1` etcd01=https://10.20.31.103 etcd02=https://10.20.31.104 etcd03=https://10.20.31.105
if [ "$currentip" == "$(echo $etcd01 | sed 's/https:\/\///')" ]; then currentname="etcd01" elif [ "$currentip" == "$(echo $etcd02 | sed 's/https:\/\///')" ]; then currentname="etcd02" elif [ "$currentip" == "$(echo $etcd03 | sed 's/https:\/\///')" ]; then currentname="etcd03" else echo "currentip:$currentip mismatching." exit 1 fi
# 停止 etcd 服务 systemctl stop etcd
# 备份当前数据 [ -d "/var/lib/etcd" ] && { echo "备份当前数据" && mv /var/lib/etcd /tmp/etcd_bak ;} #mv /etc/kubernetes/manifests /tmp/manifests_bak #如apiserver、controller-manager、scheduler停止
# 恢复备份 ETCDCTL_API=3 etcdctl \ --cacert=$SSL/ca.pem \ --cert=$SSL/kubernetes.pem \ --key=$SSL/kubernetes-key.pem \ snapshot restore "$BACKUP_FILE" \ --name $currentname \ --initial-cluster=etcd01=$etcd01:2380,etcd02=$etcd02:2380,etcd03=$etcd03:2380 \ --initial-cluster-token=etcd-cluster \ --initial-advertise-peer-urls=https://$currentip:2380 \ --data-dir=/var/lib/etcd
# 检查恢复是否成功 if [ $? -eq 0 ]; then echo "Restore successful." else echo "Restore failed." exit 1 fi
# 重新启动 etcd 服务 systemctl start etcd
#检查 etcd 集群状态 ETCDCTL_API=3 etcdctl --endpoints="$etcd01:2379,$etcd02:2379,$etcd03:2379" \ --cacert=$SSL/ca.pem \ --cert=$SSL/kubernetes.pem \ --key=$SSL/kubernetes-key.pem \ endpoint status --write-out=table

2. 备份Kubernetes资源

备份集群中所有的Kubernetes资源对象,如Deployments、Services、ConfigMaps、Secrets等。

  • 使用kubectl命令导出所有资源:

    1
    kubectl get --all-namespaces -o yaml > cluster-resources.yaml
  • 备份某个ns

1
kubectl get all -n <namespace> -o yaml > namespace-backup.yaml

3. 备份持久化数据

对于StatefulSets和有状态的应用,需要备份持久化卷(PV)中的数据。

  • 根据底层存储系统(如AWS EBS、Azure Disk、GCP Persistent Disk等)的文档进行备份。

4. 备份节点上的配置

备份每个节点上的Kubernetes配置文件,如kubelet配置文件、kube-proxy配置文件等。

5. 备份网络配置

备份集群的网络配置,如CNI插件配置、网络策略等。

6. 使用专门的备份工具

使用专门的Kubernetes备份工具,如Velero(Heptio Ark),可以提供自动化的备份和恢复流程。

  • 安装Velero并配置备份策略:

    1
    2
    velero install
    velero backup create <backup-name> --include-namespaces=<namespaces>

7. 定期测试恢复

定期执行恢复测试,确保备份数据的有效性和恢复流程的正确性。

8. 监控和日志

确保备份过程有适当的监控和日志记录,以便在出现问题时能够快速定位和解决。

注意事项:

  • 确保备份操作不会干扰集群的正常运行。
  • 考虑备份的安全性,确保备份数据的加密和安全存储。
  • 考虑备份的存储成本,选择合适的存储解决方案。
  • 考虑备份的恢复时间目标(RTO)和数据恢复点目标(RPO)
点击打赏
文章目录
  1. 1. 常用命令
    1. 1.1. 自动补全
    2. 1.2. cs(master节点)
    3. 1.3. node节点
    4. 1.4. pod
    5. 1.5. describe
    6. 1.6. log
    7. 1.7. patch
    8. 1.8. scale
    9. 1.9. port-forward
    10. 1.10. 重启
    11. 1.11. 选择器
    12. 1.12. events
    13. 1.13. api-resources
    14. 1.14. 调试 故障排除
      1. 1.14.1. 没有开启Ephemeral Containers
      2. 1.14.2. 利用Ephemeral Containers
      3. 1.14.3. nsenter
  2. 2. 证书
  • 备份
    1. 1. 1. 备份etcd
    2. 2. 2. 备份Kubernetes资源
    3. 3. 3. 备份持久化数据
    4. 4. 4. 备份节点上的配置
    5. 5. 5. 备份网络配置
    6. 6. 6. 使用专门的备份工具
    7. 7. 7. 定期测试恢复
    8. 8. 8. 监控和日志
    9. 9. 注意事项:
  • 载入天数...载入时分秒... ,