Kubernetes 支持HPA模块进行容器伸缩,默认支持CPU和内存等指标。原生的HPA基于Heapster,不支持GPU指标的伸缩,但是支持通过CustomMetrics的方式进行HPA指标的扩展。我们可以通过部署一个基于Prometheus Adapter 作为CustomMetricServer,它能将Prometheus指标注册的APIServer接口,提供HPA调用。 通过配置,HPA将CustomMetric作为扩缩容指标, 可以进行GPU指标的弹性伸缩。
k8s集群准备好gpu 服务器
NAME STATUS ROLES AGE VERSION
master-11 Ready master 466d v1.18.20
master-12 Ready master 466d v1.18.20
master-13 Ready master 466d v1.18.20
slave-gpu-103 Ready
slave-gpu-105 Ready
slave-gpu-109 Ready
slave-rtx3080-gpu-111 Ready
给每个GPU 服务器打上标签、并添加污点
kubectl label node slave-gpu-103 aliyun.accelerator/nvidia_name=yes
kubectl taint node slave-gpu-103 gpu_type=moviebook:NoSchedule
部署Prometheus 的GPU 采集器,网络采用hostNetwork
apiVersion: apps/v1
kind: DaemonSet
metadata:
namespace: monitoring
name: ack-prometheus-gpu-exporter
spec:
selector:
matchLabels:
k8s-app: ack-prometheus-gpu-exporter
template:
metadata:
labels:
k8s-app: ack-prometheus-gpu-exporter
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: aliyun.accelerator/nvidia_name
operator: Exists
hostNetwork: true
hostPID: true
containers:
- name: node-gpu-exporter
image: registry.cn-hangzhou.aliyuncs.com/acs/gpu-prometheus-exporter:0.1-5cc5f27
imagePullPolicy: Always
ports:
- name: http-metrics
containerPort: 9445
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
resources:
requests:
memory: 50Mi
cpu: 200m
limits:
memory: 100Mi
cpu: 300m
volumeMounts:
- mountPath: /var/run/docker.sock
name: docker-sock
volumes:
- hostPath:
path: /var/run/docker.sock
type: File
name: docker-sock
tolerations:
- effect: NoSchedule
key: server_type
apiVersion: v1
kind: Service
metadata:
name: node-gpu-exporter
namespace: monitoring
labels:
k8s-app: ack-prometheus-gpu-exporter
spec:
type: ClusterIP
ports:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ack-prometheus-gpu-exporter
labels:
release: ack-prometheus-operator
app: ack-prometheus-gpu-exporter
namespace: monitoring
spec:
selector:
matchLabels:
k8s-app: ack-prometheus-gpu-exporter
namespaceSelector:
matchNames:
- monitoring
endpoints:
#创建GPU 采集器
kubectl apply -f gpu-exporter.yaml
#查看Pod 状态
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ack-prometheus-gpu-exporter-c2kdj 1/1 Running 0 21h 10.147.100.111 slave-rtx3080-gpu-111
ack-prometheus-gpu-exporter-g98zv 1/1 Running 0 21h 10.147.100.105 slave-gpu-105
ack-prometheus-gpu-exporter-jn7rj 1/1 Running 0 21h 10.147.100.103 slave-gpu-103
ack-prometheus-gpu-exporter-tt7cg 1/1 Running 0 21h 10.147.100.109 slave-gpu-109
prometheus 增加监控GPU 服务器实例列表
- job_name: 'GPU服务监控'
static_configs:
#- targets: ['node-gpu-exporter.monitoring:9445']
- targets:
- 10.147.100.103:9445
- 10.147.100.105:9445
- 10.147.100.111:9445
- 10.147.100.109:9445
#重启prometheus 使配置文件生效
#查看prometheus gpu信息相关指标 nvidia_gpu_duty_cycle
准备PROMETHEUS ADAPTER的证书
#准备证书
mkdir /opt/gpu/
cd /opt/gpu/
set -e
set -o pipefail
set -u
b64_opts='--wrap=0'
export PURPOSE=metrics
openssl req -x509 -sha256 -new -nodes -days 365 -newkey rsa:2048 -keyout ${PURPOSE}-ca.key -out ${PURPOSE}-ca.crt -subj "/CN=ca"
echo '{"signing":{"default":{"expiry":"43800h","usages":["signing","key encipherment","'${PURPOSE}'"]}}}' > "${PURPOSE}-ca-config.json"
export SERVICE_NAME=custom-metrics-apiserver
export ALT_NAMES='"custom-metrics-apiserver.monitoring","custom-metrics-apiserver.monitoring.svc"'
echo "{\"CN\":\"${SERVICE_NAME}\", \"hosts\": [${ALT_NAMES}], \"key\": {\"algo\": \"rsa\",\"size\": 2048}}" | <br />
cfssl gencert -ca=metrics-ca.crt -ca-key=metrics-ca.key -config=metrics-ca-config.json - | cfssljson -bare apiserver
cat <<-EOF > cm-adapter-serving-certs.yaml
apiVersion: v1
kind: Secret
metadata:
name: cm-adapter-serving-certs
data:
serving.crt: $(base64 ${b64_opts} < apiserver.pem)
serving.key: $(base64 ${b64_opts} < apiserver-key.pem)
EOF
#创建配置文件
kubectl -n kube-system apply -f cm-adapter-serving-certs.yaml
#查看证书
#kubectl get secrets -n kube-system |grep cm-adapter-serving-certs
cm-adapter-serving-certs Opaque 2 49s
部署PROMETHEUS CUSTOMMETRIC ADAPTER
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: kube-system
name: custom-metrics-apiserver
labels:
app: custom-metrics-apiserver
spec:
replicas: 1
selector:
matchLabels:
app: custom-metrics-apiserver
template:
metadata:
labels:
app: custom-metrics-apiserver
name: custom-metrics-apiserver
spec:
serviceAccountName: custom-metrics-apiserver
containers:
- name: custom-metrics-apiserver
#image: registry.cn-beijing.aliyuncs.com/test-hub/k8s-prometheus-adapter-amd64
image: quay.io/coreos/k8s-prometheus-adapter-amd64:v0.5.0
args:
- --secure-port=6443
- --tls-cert-file=/var/run/serving-cert/serving.crt
- --tls-private-key-file=/var/run/serving-cert/serving.key
- --logtostderr=true
- --prometheus-url=http://prometheus-service.prometheus.svc.cluster.local:9090/
- --metrics-relist-interval=1m
- --v=10
- --config=/etc/adapter/config.yaml
ports:
- containerPort: 6443
volumeMounts:
- mountPath: /var/run/serving-cert
name: volume-serving-cert
readOnly: true
- mountPath: /etc/adapter/
name: config
readOnly: true
- mountPath: /tmp
name: tmp-vol
volumes:
- name: volume-serving-cert
secret:
secretName: cm-adapter-serving-certs
- name: config
configMap:
name: adapter-config
- name: tmp-vol
kind: ServiceAccount
apiVersion: v1
metadata:
name: custom-metrics-apiserver
apiVersion: v1
kind: Service
metadata:
name: custom-metrics-apiserver
namespace: kube-system
spec:
ports:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: custom-metrics-server-resources
namespace: kube-system
rules:
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: kube-system
data:
config.yaml: |
rules:
- seriesQuery: '{uuid!=""}'
resources:
overrides:
node_name: {resource: "node"}
pod_name: {resource: "pod"}
namespace_name: {resource: "namespace"}
name:
matches: ^nvidia_gpu_(.)$
as: "${1}over_time"
metricsQuery: ceil(avg_over_time(<<.Series>>{<<.LabelMatchers>>}[3m]))
- seriesQuery: '{uuid!=""}'
resources:
overrides:
node_name: {resource: "node"}
pod_name: {resource: "pod"}
namespace_name: {resource: "namespace"}
name:
matches: ^nvidia_gpu(.)$
as: "${1}_current"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: custom-metrics-resource-reader
rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: hpa-controller-custom-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: custom-metrics-server-resources
subjects:
#创建配置文件
kubectl apply -f custom-metrics-apiserver.yaml
#查看pod 状态
custom-metrics-apiserver-56777c5757-b422b 1/1 Running 0 64s
角色授权
apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
name: v1beta1.custom.metrics.k8s.io
namespace: kube-system
spec:
service:
name: custom-metrics-apiserver
namespace: kube-system
group: custom.metrics.k8s.io
version: v1beta1
insecureSkipTLSVerify: true
groupPriorityMinimum: 100
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: custom-metrics-resource-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: custom-metrics-resource-reader
subjects:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: custom-metrics:system:auth-delegator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:auth-delegator
subjects:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: custom-metrics-auth-reader
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: extension-apiserver-authentication-reader
subjects:
#创建rbac
kubectl apply -f custom-metrics-apiserver-rbac.yaml
验证部署是否完成
#部署完成后,可以通过customMetric的ApiServer调用,验证Prometheus Adapter部署成功
{"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/temperature_celsius_current"},"items":[]}
查看 kube-controller-manager 服务中是否启用以下参数,未开启需要开启
#参数
--horizontal-pod-autoscaler-use-rest-clients=true
--horizontal-pod-autoscaler-use-rest-clients=true </p>
#伸缩指标信息
指标名称
说明
单位
duty_cycle_current
GPU利用率
百分比
memory_used_bytes_current
显存使用量
字节
部署测试应用
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
name: alot-stream-python
name: alot-stream-python
namespace: alot-stream
spec:
replicas: 1
selector:
matchLabels:
name: alot-stream-python
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
name: alot-stream-python
spec:
containers:
- image: yz.harborxxx.com/alot/prod/000001-alot-ue4-runtime/python_test:20221114142453
#- image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
imagePullPolicy: IfNotPresent
name: alot-stream-python-anmoyi
resources:
limits:
nvidia.com/gpu: "1"
securityContext:
runAsGroup: 0
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /home/wkspace
name: workdir
- mountPath: /etc/localtime
name: hosttime
dnsPolicy: ClusterFirst
initContainers:
- image: yz.harbor.moviebook.com/alot/test/000001-init/node_test:20221114142453
imagePullPolicy: IfNotPresent
name: download
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /home/wkspace
name: workdir
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: server_type
operator: Exists
volumes:
- emptyDir: {}
name: workdir
- hostPath:
path: /etc/localtime
type: ""
name: hosttime
kubectl apply -f test-gpu-bert-container-alot.yaml
#查看pod 状态
NAME READY STATUS RESTARTS AGE
alot-stream-python-64ffc68756-fs8lw 1/1 Running 0 38s
创建hpa,根据GPU 利用率进行扩缩容
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: gpu-hpa-kangpeng
namespace: alot-stream
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: alot-stream-python
minReplicas: 1
maxReplicas: 4
metrics:
#创建文件
kubectl apply -f test-hap.yaml
#查看hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
gpu-hpa-kangpeng Deployment/alot-stream-python 0/20 1 4 1 21s
#查看hpa 状态,查看是否有报错信息
Name: gpu-hpa-kangpeng
Namespace: alot-stream
Labels:
Annotations: CreationTimestamp: Thu, 01 Dec 2022 15:04:25 +0800
Reference: Deployment/alot-stream-python
Metrics: ( current / target )
"duty_cycle_current" on pods: 43750m / 10
Min replicas: 1
Max replicas: 4
Deployment pods: 4 current / 4 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ScaleDownStabilized recent recommendations were higher than current one, applying the highest recent recommendation
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric duty_cycle_current
ScalingLimited True TooManyReplicas the desired replica count is more than the maximum replica count
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 2m5s horizontal-pod-autoscaler New size: 4; reason: pods metric duty_cycle_current above target
#查看pod 副本数,已自动扩展到4副本
NAME READY STATUS RESTARTS AGE
alot-stream-python-64ffc68756-62rsw 1/1 Running 0 6m3s
alot-stream-python-64ffc68756-9xcgr 1/1 Running 0 6m3s
alot-stream-python-64ffc68756-drj96 1/1 Running 0 6m3s
alot-stream-python-64ffc68756-fs8lw 1/1 Running 0 8m15s
创建hpa,根据GPU 显存使用率进行扩缩容
#显存hpa 配置文件
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: gpu-hpa-kangpeng-memory
namespace: alot-stream
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: alot-stream-python
minReplicas: 1
maxReplicas: 3
metrics:
#创建配置文件
#查看hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
gpu-hpa-kangpeng-memory Deployment/alot-stream-python 2562719744/1G 1 3 1 13s
#查看pod扩展状态
NAME READY STATUS RESTARTS AGE
alot-stream-python-64ffc68756-9xcgr 1/1 Running 0 16m
alot-stream-python-64ffc68756-b8tdh 0/1 Init:0/1 0 16s
alot-stream-python-64ffc68756-pjr66 0/1 Init:0/1 0 16s
NAME READY STATUS RESTARTS AGE
alot-stream-python-64ffc68756-9xcgr 1/1 Running 0 16m
alot-stream-python-64ffc68756-b8tdh 1/1 Running 0 56s
alot-stream-python-64ffc68756-pjr66 1/1 Running 0 56s
#待服务处理完请求将自动完成缩容
问:如何禁止自动缩容?
答:在创建HPA时,通过配置behavior来禁止自动缩容。
behavior:
scaleDown:
policies:
- type: pods
value: 0
#完整的HPA配置文件如下
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: gpu-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: bert-intent-detection
minReplicas: 1
maxReplicas: 10
metrics:
问:如何延迟缩容时间窗口?
答:缩容默认时间窗口(--horizontal-pod-autoscaler-downscale-stabilization-window)是5分钟 ,如果您需要延长时间窗口以避免一些流量毛刺造成的异常,可以指定缩容的时间窗口,behavior 参数配置示例如下:
behavior:
scaleDown:
stabilizationWindowSeconds: 600 #等待10分钟后再开始缩容。
policies:
- type: pods
value: 5 #每次只缩容5个Pod。
#上述示例表示当负载下降时,系统会等待600秒(10分钟)后再次进行缩容,每次只缩容5个Pod。
问:如果执行kubectl get hpa命令后,发现target一栏为UNKNOWN怎么办?
答:请按照以下步骤进行验证:
a、确认HorizontalPodAutoscaler中指标名字是否正确。
b、执行kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"命令,查看返回结果中是否有对应的指标。
c、检查ack-alibaba-cloud-metrics-adapter中配置的Prometheus url是否正确。
手机扫一扫
移动阅读更方便
你可能感兴趣的文章