2022-07-21

k8s集群

常用命令

缩写

certificatesigningrequests (缩写 csr)
componentstatuses (缩写 cs)
configmaps (缩写 cm)
customresourcedefinition (缩写 crd)
daemonsets (缩写 ds)
deployments (缩写 deploy)
endpoints (缩写 ep)
events (缩写 ev)
horizontalpodautoscalers (缩写 hpa)
ingresses (缩写 ing)
limitranges (缩写 limits)
namespaces (缩写 ns)
networkpolicies (缩写 netpol)
nodes (缩写 no)
persistentvolumeclaims (缩写 pvc)
persistentvolumes (缩写 pv)
poddisruptionbudgets (缩写 pdb)
pods (缩写 po)
podsecuritypolicies (缩写 psp)
replicasets (缩写 rs)
replicationcontrollers (缩写 rc)
resourcequotas (缩写 quota)
serviceaccounts (缩写 sa)
services (缩写 svc)
statefulsets (缩写 sts)
storageclasses (缩写 sc)

自动补全

sudo apt install bash-completion

source /usr/share/bash-completion/bash_completion
source <(kubectl completion bash)

echo "source <(kubectl completion bash)" >> ~/.bashrc

/usr/bin/zsh /usr/bin/bash

cs(master节点)

componentstatuses

cs@debian:~$ kubectl get cs
NAME                 STATUS    MESSAGE              ERROR
scheduler            Healthy   ok                   
controller-manager   Healthy   ok                   
etcd-2               Healthy   {"health": "true"}   
etcd-0               Healthy   {"health": "true"}   
etcd-1               Healthy   {"health": "true"}

node节点

cs@debian:~$ kubectl  get node
NAME       STATUS   ROLES    AGE    VERSION
master02   Ready    <none>   101d   v1.18.8
master03   Ready    <none>   101d   v1.18.8
node04     Ready    <none>   101d   v1.18.8
node05     Ready    <none>   101d   v1.18.8
node06     Ready    <none>   101d   v1.18.8

kubectl get node -o wide

空间

cs@debian:~$ kubectl get namespaces 
NAME                   STATUS   AGE
default                Active   101d
devops                 Active   100d
kube-node-lease        Active   101d
kube-public            Active   101d
kube-system            Active   101d
kubernetes-dashboard   Active   92d

pod

cs@debian:~$ kubectl get pod
No resources found in default namespace.

cs@debian:~$ kubectl get pod -n kube-system 
NAME                                         READY   STATUS    RESTARTS   AGE
coredns-56ff7bc666-prc6l                     1/1     Running   4          11d
coredns-56ff7bc666-qwdsh                     1/1     Running   8          92d
traefik-ingress-controller-7769cb875-x76rs   1/1     Running   1          46h

-n 接namespaces的NAME值,省略为default

1 2	kubectl get pods -o wide kubectl get pods -A -o wide

-o wide 显示资源的额外信息

-A 所有pod资源

describe

ingress Tab提示

cs@debian:~$ kubectl get ingress -n devops 
ingressclasses.networking.k8s.io      ingresses.networking.k8s.io           ingressroutetcps.traefik.containo.us  
ingresses.extensions                  ingressroutes.traefik.containo.us     ingressrouteudps.traefik.containo.us  

cs@debian:~$ kubectl get ingressroutetcps.traefik.containo.us -n devops 
NAME    AGE
redis   47h

cs@debian:~$ kubectl describe ingressroutetcps.traefik.containo.us redis -n devops 
Name:         redis
Namespace:    devops
Labels:       <none>
Annotations:  API Version:  traefik.containo.us/v1alpha1
Kind:         IngressRouteTCP
Metadata:
  Creation Timestamp:  2022-07-19T13:01:37Z
  Generation:          1
  Managed Fields:
    API Version:  traefik.containo.us/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:entryPoints:
        f:routes:
    Manager:         kubectl
    Operation:       Update
    Time:            2022-07-19T13:01:37Z
  Resource Version:  261095
  Self Link:         /apis/traefik.containo.us/v1alpha1/namespaces/devops/ingressroutetcps/redis
  UID:               c79681ee-1bf1-4843-9666-457970d78f27
Spec:
  Entry Points:
    redis
  Routes:
    Match:  HostSNI(`*`)
    Services:
      Name:  redis-service
      Port:  6379
Events:      <none>

log

cs@debian:~$ kubectl logs  --tail=5 redis-app-1 -n devops
63:M 21 Jul 2022 12:15:34.720 * Synchronization with replica 121.21.25.3:6379 succeeded
63:M 21 Jul 2022 12:15:36.469 # Cluster state changed: ok
63:M 21 Jul 2022 12:15:40.604 * FAIL message received from 67f931358b8004268db0b57932293602ab3de629 about b49a123e2764b665ee898c21f983b9cda70cda00
63:M 21 Jul 2022 12:15:42.635 * Marking node a0e2f50ba382870da1ce4d23b66a1375826d6dc8 as failing (quorum reached).
63:M 21 Jul 2022 12:15:42.635 # Cluster state changed: fail

cs@debian:~$ kubectl logs  -f  --tail=5 redis-app-1 -n devops
63:M 21 Jul 2022 12:15:34.720 * Synchronization with replica 121.21.25.3:6379 succeeded
63:M 21 Jul 2022 12:15:36.469 # Cluster state changed: ok
63:M 21 Jul 2022 12:15:40.604 * FAIL message received from 67f931358b8004268db0b57932293602ab3de629 about b49a123e2764b665ee898c21f983b9cda70cda00
63:M 21 Jul 2022 12:15:42.635 * Marking node a0e2f50ba382870da1ce4d23b66a1375826d6dc8 as failing (quorum reached).
63:M 21 Jul 2022 12:15:42.635 # Cluster state changed: fail
^C

-f 类似 tail -f

-p, –previous[=false]: 如果为true，输出pod中曾经运行过，但目前已终止的容器的日志
–since=0: 仅返回相对时间范围，如5s、2m或3h，之内的日志。默认返回所有日志。只能同时使用since 和since-time中的一种
–since-time=””: 仅返回指定时间（RFC3339格式）之后的日志。默认返回所有日志。只能同时使用since和since-time中的一种
–tail=-1: 要显示的最新的日志条数。默认为-1，显示所有的日志

patch

更新容器的镜像

1
2
3

kubectl patch pod valid-pod -p '{"spec":{"containers":[{"name":"kubernetes-serve-hostname","image":"new image"}]}}'
或
kubectl patch pod valid-pod --type='json' -p='[{"op": "replace", "path": "/spec/containers/0/image", "value":"new image"}]'

设置服务对外的IP

1	kubectl patch svc <svc-name> -n <namespace> -p '{"spec": {"type": "LoadBalancer", "externalIPs":["192.168.31.241"]}}'

scale

对副本数进行扩展或缩小

前提条件校验；当前副本数量或 --resource-version

缩减副本数到2

1	kubectl scale rc rc-nginx-3 —replicas=2

当前副本数为2，则将其扩展至3

1	kubectl scale --current-replicas=2 --replicas=3 deployment/mysql

port-forward

https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#port-forward

把本地的port 映射到pod的port,便于测试效果

1	kubectl port-forward deployment/redis-master 6379:6379

重启

1	kubectl rollout restart deployment <deployment_name> -n <namespace>

选择器

–field-selector

status.podIP

1
2
3

cs@debian:~/oss/hexo$  kubectl  get pod --field-selector status.podIP=121.21.35.3 -o wide -n devops
NAME          READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
redis-app-1   1/1     Running   1          9d    121.21.35.3   node04   <none>           <none>

metadata.name=my-service

metadata.namespace!=default

status.phase=Pending

选择了所有**status.phase不为Running且spec.restartPolicy为Always**的Pod.

1	kubectl get pods --field-selector=status.phase!=Running,spec.restartPolicy=Always

events

显示集群内的详细事件，如果最近出现故障，你可以查看集群事件以了解故障前后发生的情况。如果你知道只有特定名称空间中存在问题，你可以将事件过滤到该名称空间。

$ kubectl get events 
LAST SEEN   TYPE     REASON                    OBJECT     MESSAGE
2d18h       Normal   Starting                  node/k8s   Starting kubelet.
2d18h       Normal   NodeHasSufficientMemory   node/k8s   Node k8s status is now: NodeHasSufficientMemory
2d18h       Normal   NodeHasNoDiskPressure     node/k8s   Node k8s status is now: NodeHasNoDiskPressure
2d18h       Normal   NodeHasSufficientPID      node/k8s   Node k8s status is now: NodeHasSufficientPID
2d18h       Normal   NodeAllocatableEnforced   node/k8s   Updated Node Allocatable limit across pods
2d18h       Normal   Starting                  node/k8s   Starting kubelet.
2d18h       Normal   NodeHasSufficientMemory   node/k8s   Node k8s status is now: NodeHasSufficientMemory
2d18h       Normal   NodeHasNoDiskPressure     node/k8s   Node k8s status is now: NodeHasNoDiskPressure
2d18h       Normal   NodeHasSufficientPID      node/k8s   Node k8s status is now: NodeHasSufficientPID
2d18h       Normal   NodeAllocatableEnforced   node/k8s   Updated Node Allocatable limit across pods
2d18h       Normal   NodeReady                 node/k8s   Node k8s status is now: NodeReady
2d18h       Normal   RegisteredNode            node/k8s   Node k8s event: Registered Node k8s in Controller
2d18h       Normal   Starting                  node/k8s   Starting kube-proxy.
2d18h       Normal   RegisteredNode            node/k8s   Node k8s event: Registered Node k8s in Controller
23m         Normal   Starting                  node/k8s   Starting kubelet.
23m         Normal   NodeHasSufficientMemory   node/k8s   Node k8s status is now: NodeHasSufficientMemory
23m         Normal   NodeHasNoDiskPressure     node/k8s   Node k8s status is now: NodeHasNoDiskPressure
23m         Normal   NodeHasSufficientPID      node/k8s   Node k8s status is now: NodeHasSufficientPID
23m         Normal   NodeAllocatableEnforced   node/k8s   Updated Node Allocatable limit across pods
23m         Normal   Starting                  node/k8s   Starting kube-proxy.
22m         Normal   RegisteredNode            node/k8s   Node k8s event: Registered Node k8s in Controller

api-resources

1	kubectl api-resources -o wide --sort-by name

调试故障排除

启动参数--feature-gates=EphemeralContainers=true配置到kube-api和kubelet服务上重启

# 查看pod所在宿主及pod name
$ kubectl get po -o wide
# 根据pod name查看对应的docker 容器
$ docker ps | grep centos-687ff6c787-47gvh
# 根据输出的容器id，挂载容器网络并运行一个debug容器，使用 nicolaka/netshoot 这个镜像。这个镜像里集成了很多网络调试工具。
$ docker run -it --rm --name=debug --network=container:bb009aab414f nicolaka/netshoot bash
接下来就进入了与这个pod的相同的网络namespace，可以进行网络相关的调试了

for pod in $(kubectl get -o name pod  -n kube-system); 
do
    kubectl debug --image security/pod_scanner -p $pod /sanner.sh
done

批量跑某个命名空间下的安全扫描的脚本而不用干扰原容器

没有开启Ephemeral Containers

kubectl debug mypod -it \
--container=debug \
--image=busybox \
--copy-to=my-debugger \
--same-node=true \
--share-processes=true

–copy-to 指定新pod的名称
–replace=true 是否删除原容器
–same-node=true 是否调度到和原容器一样的node上
–share-processes=true 是否共享容器pid空间

利用Ephemeral Containers

1	kubectl run ephemeral-demo --image=k8s.gcr.io/pause:3.1 --restart=Never

kubectl debug

1 2	#kubectl v1.12.0 或更高的版本, 可以直接使用: kubectl debug -h

调试 Pod

❯ kubectl debug -it vault-1 -n vault --image=k8s.org/cs/netshoot  -- bash
Defaulting debug container name to debugger-z9zr4.
If you don't see a command prompt, try pressing enter.
vault-1:/root$

service ping不通

$  cat /var/lib/kubelet/config.yaml  | grep -A 1 DNS
clusterDNS:
- 10.96.1.10

$ zgrep "cluster-cidr\|cluster-ip" /etc/kubernetes/manifests/*
/etc/kubernetes/manifests/kube-apiserver.yaml:    - --service-cluster-ip-range=10.96.0.0/12
/etc/kubernetes/manifests/kube-controller-manager.yaml:    - --cluster-cidr=121.21.0.0/16
/etc/kubernetes/manifests/kube-controller-manager.yaml:    - --service-cluster-ip-range=10.96.0.0/16

dial tcp: lookup vault-2.vault-internal on 121.21.0.0:53: read udp 121.21.64.55:53568->121.21.0.0:53 i/o timeout

cat /etc/resolv.conf dns（service clusterIP跟serviceSubnet）

发现vault-1 跟 vault-0.vault-internal ，vault-2.vault-internal都ping不通

pod ping不通

1 2	kube-flannel-cfg

[ERROR] storage.raft: failed to make requestVote RPC: target=”{Voter vault-0 vault-0.vault-internal:8201}” error=”dial tcp 121.21.80.174:8201: connect: connection refused” term=679

[ERROR] storage.raft: failed to make requestVote RPC: target=”{Voter vault-2 vault-2.vault-internal:8201}” error=”dial tcp 121.21.48.151:8201: connect: connection refused” term=679

清理调试 Pod

1
2
3

Creating debugging pod node-debugger-mynode-pdx84 with container debugger on node ....

❯kubectl delete pod node-debugger-mynode-pdx84

nsenter

1	sudo yum install -y util-linux

1
2
3

$ docker inspect -f {{.State.Pid}} nginx
#nsenter命令进入该容器的网络命令空间
$ nsenter -n -t6700

证书

检查有效期

1	kubeadm certs check-expiration

使用kubeadm搭建的集群，默认CA证书的有效期为10年，其他组件访问证书的有效期为1年

 #备份
 cp -R /etc/kubernetes/pki   /tmp/pki_backup
#在所有控制平面节点执行
[root@k8s01 ~]# kubeadm  certs renew all 
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

1
2
3

systemctl restart kubelet

cp /etc/kubernetes/admin.conf $HOME/.kube/config

memcache.go:265] couldn’t get current server API group list: the server has asked for the client to provide credentials
error: You must be logged in to the server (the server has asked for the client to provide credentials) 没有认证

❯ kubectl get nodes
E0422 14:21:12.120751  102845 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0422 14:21:12.126398  102845 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0422 14:21:12.130678  102845 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0422 14:21:12.132943  102845 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME    STATUS   ROLES           AGE    VERSION
k8s01   Ready    control-plane   399d   v1.26.1
k8s02   Ready    control-plane   399d   v1.26.1
k8s03   Ready    control-plane   399d   v1.26.1
k8s04   Ready    <none>          393d   v1.26.1
k8s05   Ready    <none>          393d   v1.26.1
k8s06   Ready    <none>          393d   v1.26.1
k8s07   Ready    <none>          179d   v1.26.1
k8s08   Ready    <none>          178d   v1.26.1
❯ kubectl  top node
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)

https://github.com/kubernetes-sigs/metrics-server/issues/157

1 2	❯ kubectl get pods -o wide -A \| grep kubernetes-dashboard ❯ kubectl top nodes

备份

备份整个Kubernetes集群

1. 备份etcd

etcd是Kubernetes集群的核心，存储了所有的集群状态和配置信息。备份etcd是确保可以恢复集群状态的关键步骤。

使用etcdctl工具备份etcd数据：

1	etcdctl --endpoints <etcd-endpoints> snapshot save <backup-file-path>

备份脚本

backup_etcd.sh

backup_etcd.sh
#!/bin/bash
set -e
exec >> /var/log/backup_etcd.log


time=`date "+%Y-%m-%d %H:%M:%S #"`
Endpoints="https://192.168.56.101:2379,https://192.168.56.102:2379,https://192.168.56.103:2379"
EXEC="/usr/bin/etcdctl"
SSL=/opt/k8s/ssl
BackupDir="/home/www/backup/etcd"
BackupFile="snapshot.db.$time"


echo "$time backup etcd start..."


export ETCDCTL_API=3
$EXEC --endpoints $Endpoints \
     --cacert=$SSL/ca.pem \
     --cert=$SSL/kubernetes.pem \
     --key=$SSL/kubernetes-key.pem \
     snapshot save  $BackupDir/$BackupFile


echo  "$time backup etcd end!"


#find $BackupDir -name "snapshot.db.*" -type f -mtime +7 -exec rm -rf {} \; > /dev/null 2>&1
num=2
pre="$BackupDir/snapshot.db"
delf=`ls -lrt $pre.*  | awk '{print $9 }' | head -1`
count=`ls -lrt $pre.*  | awk '{print $9 }' | wc -l`
if [ $count -gt $num ] 
then 
  rm $delf  
  echo "$time delete $delf" 
fi

恢复脚本

restore.sh

restore
#!/bin/bash


# 定义备份文件路径
BACKUP_FILE="/var/backups/etcd/etcd_backup.db"
SSL=/opt/k8s/ssl


#获取网卡ip wlan0
currentip=`ip addr show wlan0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1`
etcd01=https://10.20.31.103
etcd02=https://10.20.31.104
etcd03=https://10.20.31.105


if [ "$currentip" == "$(echo $etcd01 | sed 's/https:\/\///')" ]; then
    currentname="etcd01"
elif [ "$currentip" == "$(echo $etcd02 | sed 's/https:\/\///')" ]; then
    currentname="etcd02"
elif [ "$currentip" == "$(echo $etcd03 | sed 's/https:\/\///')" ]; then
    currentname="etcd03"
else
    echo "currentip:$currentip mismatching."
    exit 1
fi


# 停止 etcd 服务
systemctl stop etcd


# 备份当前数据
[ -d "/var/lib/etcd" ] &&  {   echo "备份当前数据" &&  mv  /var/lib/etcd  /tmp/etcd_bak  ;} 
#mv /etc/kubernetes/manifests  /tmp/manifests_bak  #如apiserver、controller-manager、scheduler停止


# 恢复备份
ETCDCTL_API=3  etcdctl  \
     --cacert=$SSL/ca.pem \
     --cert=$SSL/kubernetes.pem \
     --key=$SSL/kubernetes-key.pem \
     snapshot restore "$BACKUP_FILE" \
     --name $currentname \
  --initial-cluster=etcd01=$etcd01:2380,etcd02=$etcd02:2380,etcd03=$etcd03:2380 \
--initial-cluster-token=etcd-cluster \
--initial-advertise-peer-urls=https://$currentip:2380 \
   --data-dir=/var/lib/etcd


# 检查恢复是否成功
if [ $? -eq 0 ]; then
    echo "Restore successful."
else
    echo "Restore failed."
    exit 1
fi


# 重新启动 etcd 服务
systemctl start etcd


#检查 etcd 集群状态
ETCDCTL_API=3  etcdctl --endpoints="$etcd01:2379,$etcd02:2379,$etcd03:2379" \
     --cacert=$SSL/ca.pem \
     --cert=$SSL/kubernetes.pem \
     --key=$SSL/kubernetes-key.pem \
endpoint status --write-out=table

2. 备份Kubernetes资源

备份集群中所有的Kubernetes资源对象，如Deployments、Services、ConfigMaps、Secrets等。

使用kubectl命令导出所有资源：

1	kubectl get --all-namespaces -o yaml > cluster-resources.yaml

备份某个ns

1	kubectl get all -n <namespace> -o yaml > namespace-backup.yaml

3. 备份持久化数据

对于StatefulSets和有状态的应用，需要备份持久化卷（PV）中的数据。

根据底层存储系统（如AWS EBS、Azure Disk、GCP Persistent Disk等）的文档进行备份。

4. 备份节点上的配置

备份每个节点上的Kubernetes配置文件，如kubelet配置文件、kube-proxy配置文件等。

5. 备份网络配置

备份集群的网络配置，如CNI插件配置、网络策略等。

6. 使用专门的备份工具

使用专门的Kubernetes备份工具，如Velero（Heptio Ark），可以提供自动化的备份和恢复流程。

安装Velero并配置备份策略：

1 2	velero install velero backup create <backup-name> --include-namespaces=<namespaces>

7. 定期测试恢复

定期执行恢复测试，确保备份数据的有效性和恢复流程的正确性。

8. 监控和日志

确保备份过程有适当的监控和日志记录，以便在出现问题时能够快速定位和解决。

注意事项：

确保备份操作不会干扰集群的正常运行。
考虑备份的安全性，确保备份数据的加密和安全存储。
考虑备份的存储成本，选择合适的存储解决方案。
考虑备份的恢复时间目标（RTO）和数据恢复点目标（RPO）

本文标题:k8s集群

文章作者: 阿帅

发布时间:2022年07月21日 - 20时24分

最后更新:2024年07月25日 - 17时06分

原始链接:https://chengshea.github.io/linux/k8s/k8s01/

许可协议: "署名-非商用-相同方式共享 3.0" 转载请保留原文链接及作者。

点击打赏