基于GPU 显卡在k8s 集群上实现hpa 功能
阅读原文时间:2023年07月10日阅读:1

前言

Kubernetes 支持HPA模块进行容器伸缩,默认支持CPU和内存等指标。原生的HPA基于Heapster,不支持GPU指标的伸缩,但是支持通过CustomMetrics的方式进行HPA指标的扩展。我们可以通过部署一个基于Prometheus Adapter 作为CustomMetricServer,它能将Prometheus指标注册的APIServer接口,提供HPA调用。 通过配置,HPA将CustomMetric作为扩缩容指标, 可以进行GPU指标的弹性伸缩。

阿里云容器Kubernetes监控-GPU监控

  • k8s集群准备好gpu 服务器

    kubectl get node

    NAME STATUS ROLES AGE VERSION
    master-11 Ready master 466d v1.18.20
    master-12 Ready master 466d v1.18.20
    master-13 Ready master 466d v1.18.20
    slave-gpu-103 Ready 159d v1.18.20
    slave-gpu-105 Ready 160d v1.18.20
    slave-gpu-109 Ready 160d v1.18.20
    slave-rtx3080-gpu-111 Ready 6d3h v1.18.20

  • 给每个GPU 服务器打上标签、并添加污点

    kubectl label node slave-gpu-103 aliyun.accelerator/nvidia_name=yes
    kubectl taint node slave-gpu-103 gpu_type=moviebook:NoSchedule

  • 部署Prometheus 的GPU 采集器,网络采用hostNetwork

    cat gpu-exporter.yaml

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    namespace: monitoring
    name: ack-prometheus-gpu-exporter
    spec:
    selector:
    matchLabels:
    k8s-app: ack-prometheus-gpu-exporter
    template:
    metadata:
    labels:
    k8s-app: ack-prometheus-gpu-exporter
    spec:
    affinity:
    nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchExpressions:
    - key: aliyun.accelerator/nvidia_name
    operator: Exists
    hostNetwork: true
    hostPID: true
    containers:
    - name: node-gpu-exporter
    image: registry.cn-hangzhou.aliyuncs.com/acs/gpu-prometheus-exporter:0.1-5cc5f27
    imagePullPolicy: Always
    ports:
    - name: http-metrics
    containerPort: 9445
    env:
    - name: MY_NODE_NAME
    valueFrom:
    fieldRef:
    apiVersion: v1
    fieldPath: spec.nodeName
    resources:
    requests:
    memory: 50Mi
    cpu: 200m
    limits:
    memory: 100Mi
    cpu: 300m
    volumeMounts:
    - mountPath: /var/run/docker.sock
    name: docker-sock
    volumes:
    - hostPath:
    path: /var/run/docker.sock
    type: File
    name: docker-sock
    tolerations:
    - effect: NoSchedule
    key: server_type

    operator: Exists

    apiVersion: v1
    kind: Service
    metadata:
    name: node-gpu-exporter
    namespace: monitoring
    labels:
    k8s-app: ack-prometheus-gpu-exporter
    spec:
    type: ClusterIP
    ports:

    • name: http-metrics
      port: 9445
      protocol: TCP
      selector:
      k8s-app: ack-prometheus-gpu-exporter

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    name: ack-prometheus-gpu-exporter
    labels:
    release: ack-prometheus-operator
    app: ack-prometheus-gpu-exporter
    namespace: monitoring
    spec:
    selector:
    matchLabels:
    k8s-app: ack-prometheus-gpu-exporter
    namespaceSelector:
    matchNames:
    - monitoring
    endpoints:

    • port: http-metrics
      interval: 30s

    #创建GPU 采集器
    kubectl apply -f gpu-exporter.yaml

    #查看Pod 状态

    kubectl get pod -n monitoring -o wide

    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    ack-prometheus-gpu-exporter-c2kdj 1/1 Running 0 21h 10.147.100.111 slave-rtx3080-gpu-111
    ack-prometheus-gpu-exporter-g98zv 1/1 Running 0 21h 10.147.100.105 slave-gpu-105
    ack-prometheus-gpu-exporter-jn7rj 1/1 Running 0 21h 10.147.100.103 slave-gpu-103
    ack-prometheus-gpu-exporter-tt7cg 1/1 Running 0 21h 10.147.100.109 slave-gpu-109

  • prometheus 增加监控GPU 服务器实例列表

    kubectl edit cm -n prometheus prometheus-conf

      - job_name: 'GPU服务监控'
        static_configs:
          #- targets: ['node-gpu-exporter.monitoring:9445']
          - targets:
            - 10.147.100.103:9445
            - 10.147.100.105:9445
            - 10.147.100.111:9445
            - 10.147.100.109:9445

    #重启prometheus 使配置文件生效

    #查看prometheus gpu信息相关指标 nvidia_gpu_duty_cycle

部署CustomMetricServer

  • 准备PROMETHEUS ADAPTER的证书

    #准备证书
    mkdir /opt/gpu/
    cd /opt/gpu/

    set -e
    set -o pipefail
    set -u
    b64_opts='--wrap=0'

    export PURPOSE=metrics
    openssl req -x509 -sha256 -new -nodes -days 365 -newkey rsa:2048 -keyout ${PURPOSE}-ca.key -out ${PURPOSE}-ca.crt -subj "/CN=ca"
    echo '{"signing":{"default":{"expiry":"43800h","usages":["signing","key encipherment","'${PURPOSE}'"]}}}' > "${PURPOSE}-ca-config.json"

    export SERVICE_NAME=custom-metrics-apiserver
    export ALT_NAMES='"custom-metrics-apiserver.monitoring","custom-metrics-apiserver.monitoring.svc"'
    echo "{\"CN\":\"${SERVICE_NAME}\", \"hosts\": [${ALT_NAMES}], \"key\": {\"algo\": \"rsa\",\"size\": 2048}}" | <br /> cfssl gencert -ca=metrics-ca.crt -ca-key=metrics-ca.key -config=metrics-ca-config.json - | cfssljson -bare apiserver

    cat <<-EOF > cm-adapter-serving-certs.yaml
    apiVersion: v1
    kind: Secret
    metadata:
    name: cm-adapter-serving-certs
    data:
    serving.crt: $(base64 ${b64_opts} < apiserver.pem)
    serving.key: $(base64 ${b64_opts} < apiserver-key.pem)
    EOF

    #创建配置文件
    kubectl -n kube-system apply -f cm-adapter-serving-certs.yaml

    #查看证书
    #kubectl get secrets -n kube-system |grep cm-adapter-serving-certs
    cm-adapter-serving-certs Opaque 2 49s

  • 部署PROMETHEUS CUSTOMMETRIC ADAPTER

    cat custom-metrics-apiserver.yaml

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    namespace: kube-system
    name: custom-metrics-apiserver
    labels:
    app: custom-metrics-apiserver
    spec:
    replicas: 1
    selector:
    matchLabels:
    app: custom-metrics-apiserver
    template:
    metadata:
    labels:
    app: custom-metrics-apiserver
    name: custom-metrics-apiserver
    spec:
    serviceAccountName: custom-metrics-apiserver
    containers:
    - name: custom-metrics-apiserver
    #image: registry.cn-beijing.aliyuncs.com/test-hub/k8s-prometheus-adapter-amd64
    image: quay.io/coreos/k8s-prometheus-adapter-amd64:v0.5.0
    args:
    - --secure-port=6443
    - --tls-cert-file=/var/run/serving-cert/serving.crt
    - --tls-private-key-file=/var/run/serving-cert/serving.key
    - --logtostderr=true
    - --prometheus-url=http://prometheus-service.prometheus.svc.cluster.local:9090/
    - --metrics-relist-interval=1m
    - --v=10
    - --config=/etc/adapter/config.yaml
    ports:
    - containerPort: 6443
    volumeMounts:
    - mountPath: /var/run/serving-cert
    name: volume-serving-cert
    readOnly: true
    - mountPath: /etc/adapter/
    name: config
    readOnly: true
    - mountPath: /tmp
    name: tmp-vol
    volumes:
    - name: volume-serving-cert
    secret:
    secretName: cm-adapter-serving-certs
    - name: config
    configMap:
    name: adapter-config
    - name: tmp-vol

    emptyDir: {}

    kind: ServiceAccount
    apiVersion: v1
    metadata:
    name: custom-metrics-apiserver

    namespace: kube-system

    apiVersion: v1
    kind: Service
    metadata:
    name: custom-metrics-apiserver
    namespace: kube-system
    spec:
    ports:

    • port: 443
      targetPort: 6443
      selector:

    app: custom-metrics-apiserver

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    name: custom-metrics-server-resources
    namespace: kube-system
    rules:

    • apiGroups:
      • custom.metrics.k8s.io
        resources: ["*"]

    verbs: ["*"]

    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: adapter-config
    namespace: kube-system
    data:
    config.yaml: |
    rules:
    - seriesQuery: '{uuid!=""}'
    resources:
    overrides:
    node_name: {resource: "node"}
    pod_name: {resource: "pod"}
    namespace_name: {resource: "namespace"}
    name:
    matches: ^nvidia_gpu_(.)$ as: "${1}over_time" metricsQuery: ceil(avg_over_time(<<.Series>>{<<.LabelMatchers>>}[3m])) - seriesQuery: '{uuid!=""}' resources: overrides: node_name: {resource: "node"} pod_name: {resource: "pod"} namespace_name: {resource: "namespace"} name: matches: ^nvidia_gpu(.)$
    as: "${1}_current"

    metricsQuery: <<.Series>>{<<.LabelMatchers>>}

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    name: custom-metrics-resource-reader
    rules:

    • apiGroups:
      • ""
        resources:
      • namespaces
      • pods
      • services
        verbs:
      • get

    - list

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    name: hpa-controller-custom-metrics
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: custom-metrics-server-resources
    subjects:

    • kind: ServiceAccount
      name: horizontal-pod-autoscaler
      namespace: kube-system

    #创建配置文件
    kubectl apply -f custom-metrics-apiserver.yaml

    #查看pod 状态

    kubectl get pod -n kube-system |grep custom-metrics-apiserver

    custom-metrics-apiserver-56777c5757-b422b 1/1 Running 0 64s

  • 角色授权

    cat custom-metrics-apiserver-rbac.yaml

    apiVersion: apiregistration.k8s.io/v1beta1
    kind: APIService
    metadata:
    name: v1beta1.custom.metrics.k8s.io
    namespace: kube-system
    spec:
    service:
    name: custom-metrics-apiserver
    namespace: kube-system
    group: custom.metrics.k8s.io
    version: v1beta1
    insecureSkipTLSVerify: true
    groupPriorityMinimum: 100

    versionPriority: 100

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    name: custom-metrics-resource-reader
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: custom-metrics-resource-reader
    subjects:

    • kind: ServiceAccount
      name: custom-metrics-apiserver

    namespace: kube-system

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    name: custom-metrics:system:auth-delegator
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: system:auth-delegator
    subjects:

    • kind: ServiceAccount
      name: custom-metrics-apiserver

    namespace: kube-system

    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
    name: custom-metrics-auth-reader
    namespace: kube-system
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: extension-apiserver-authentication-reader
    subjects:

    • kind: ServiceAccount
      name: custom-metrics-apiserver
      namespace: kube-system

    #创建rbac
    kubectl apply -f custom-metrics-apiserver-rbac.yaml

  • 验证部署是否完成

    #部署完成后,可以通过customMetric的ApiServer调用,验证Prometheus Adapter部署成功

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/temperature_celsius_current"

    {"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/temperature_celsius_current"},"items":[]}

  • 查看 kube-controller-manager 服务中是否启用以下参数,未开启需要开启

    #参数
    --horizontal-pod-autoscaler-use-rest-clients=true

    cat /etc/systemd/system/kube-controller-manager.service |grep horizontal-pod-autoscaler-use-rest-clients

    --horizontal-pod-autoscaler-use-rest-clients=true </p>

伸缩指标

#伸缩指标信息

测试GPU 服务的弹性扩缩容

指标名称

说明

单位

duty_cycle_current

GPU利用率

百分比

memory_used_bytes_current

显存使用量

字节

部署HPA

  • 部署测试应用

    cat test-gpu-bert-container-alot.yaml

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    labels:
    name: alot-stream-python
    name: alot-stream-python
    namespace: alot-stream
    spec:
    replicas: 1
    selector:
    matchLabels:
    name: alot-stream-python
    strategy:
    rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 25%
    type: RollingUpdate
    template:
    metadata:
    creationTimestamp: null
    labels:
    name: alot-stream-python
    spec:
    containers:
    - image: yz.harborxxx.com/alot/prod/000001-alot-ue4-runtime/python_test:20221114142453
    #- image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
    imagePullPolicy: IfNotPresent
    name: alot-stream-python-anmoyi
    resources:
    limits:
    nvidia.com/gpu: "1"
    securityContext:
    runAsGroup: 0
    runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/wkspace
    name: workdir
    - mountPath: /etc/localtime
    name: hosttime
    dnsPolicy: ClusterFirst
    initContainers:
    - image: yz.harbor.moviebook.com/alot/test/000001-init/node_test:20221114142453
    imagePullPolicy: IfNotPresent
    name: download
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/wkspace
    name: workdir
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoSchedule
    key: server_type
    operator: Exists
    volumes:
    - emptyDir: {}
    name: workdir
    - hostPath:
    path: /etc/localtime
    type: ""
    name: hosttime

    创建测试应用

    kubectl apply -f test-gpu-bert-container-alot.yaml

    #查看pod 状态

    kubectl get pod -n alot-stream |grep -v "prod"

    NAME READY STATUS RESTARTS AGE
    alot-stream-python-64ffc68756-fs8lw 1/1 Running 0 38s

  • 创建hpa,根据GPU 利用率进行扩缩容

    cat test-hap.yaml

    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    metadata:
    name: gpu-hpa-kangpeng
    namespace: alot-stream
    spec:
    scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: alot-stream-python
    minReplicas: 1
    maxReplicas: 4
    metrics:

    • type: Pods
      pods:
      metricName: duty_cycle_current #Pod的GPU利用率。
      targetAverageValue: 10 #当GPU利用率超过20%,触发扩容。

    #创建文件
    kubectl apply -f test-hap.yaml

    #查看hpa

    kubectl get hpa -n alot-stream

    NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
    gpu-hpa-kangpeng Deployment/alot-stream-python 0/20 1 4 1 21s

    #查看hpa 状态,查看是否有报错信息

    kubectl describe hpa -n alot-stream gpu-hpa-kangpeng

    Name: gpu-hpa-kangpeng
    Namespace: alot-stream
    Labels:
    Annotations: CreationTimestamp: Thu, 01 Dec 2022 15:04:25 +0800
    Reference: Deployment/alot-stream-python
    Metrics: ( current / target )
    "duty_cycle_current" on pods: 43750m / 10
    Min replicas: 1
    Max replicas: 4
    Deployment pods: 4 current / 4 desired
    Conditions:
    Type Status Reason Message
    ---- ------ ------ -------
    AbleToScale True ScaleDownStabilized recent recommendations were higher than current one, applying the highest recent recommendation
    ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric duty_cycle_current
    ScalingLimited True TooManyReplicas the desired replica count is more than the maximum replica count
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal SuccessfulRescale 2m5s horizontal-pod-autoscaler New size: 4; reason: pods metric duty_cycle_current above target

    #查看pod 副本数,已自动扩展到4副本

    kubectl get pod -n alot-stream |grep -v "prod"

    NAME READY STATUS RESTARTS AGE
    alot-stream-python-64ffc68756-62rsw 1/1 Running 0 6m3s
    alot-stream-python-64ffc68756-9xcgr 1/1 Running 0 6m3s
    alot-stream-python-64ffc68756-drj96 1/1 Running 0 6m3s
    alot-stream-python-64ffc68756-fs8lw 1/1 Running 0 8m15s

  • 创建hpa,根据GPU 显存使用率进行扩缩容

    #显存hpa 配置文件

    cat test-hap-kangpeng-memory.yaml

    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    metadata:
    name: gpu-hpa-kangpeng-memory
    namespace: alot-stream
    spec:
    scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: alot-stream-python
    minReplicas: 1
    maxReplicas: 3
    metrics:

    • type: Pods
      pods:
      metricName: memory_used_bytes_current #Pod使用的显存,单位为字节。
      targetAverageValue: 1G #单个Pod显存超过4 GB触发扩容。

    #创建配置文件

    kubectl apply -f test-hap-kangpeng-memory.yaml

    #查看hpa

    kubectl get hpa -n alot-stream

    NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
    gpu-hpa-kangpeng-memory Deployment/alot-stream-python 2562719744/1G 1 3 1 13s

    #查看pod扩展状态

    kubectl get pod -n alot-stream |grep -v "prod"

    NAME READY STATUS RESTARTS AGE
    alot-stream-python-64ffc68756-9xcgr 1/1 Running 0 16m
    alot-stream-python-64ffc68756-b8tdh 0/1 Init:0/1 0 16s
    alot-stream-python-64ffc68756-pjr66 0/1 Init:0/1 0 16s

    kubectl get pod -n alot-stream |grep -v "prod"

    NAME READY STATUS RESTARTS AGE
    alot-stream-python-64ffc68756-9xcgr 1/1 Running 0 16m
    alot-stream-python-64ffc68756-b8tdh 1/1 Running 0 56s
    alot-stream-python-64ffc68756-pjr66 1/1 Running 0 56s

    #待服务处理完请求将自动完成缩容

补充

  • 问:如何禁止自动缩容?

    答:在创建HPA时,通过配置behavior来禁止自动缩容。

    behavior:
    scaleDown:
    policies:
    - type: pods
    value: 0

    #完整的HPA配置文件如下
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    metadata:
    name: gpu-hpa
    spec:
    scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-intent-detection
    minReplicas: 1
    maxReplicas: 10
    metrics:

    • type: Pods pods: metricName: duty_cycle_current #Pod的GPU利用率。 targetAverageValue: 20 #GPU利用率超过20%,触发扩容。 behavior: scaleDown: policies:
      • type: pods
        value: 0
  • 问:如何延迟缩容时间窗口?

    答:缩容默认时间窗口(--horizontal-pod-autoscaler-downscale-stabilization-window)是5分钟 ,如果您需要延长时间窗口以避免一些流量毛刺造成的异常,可以指定缩容的时间窗口,behavior 参数配置示例如下:

    behavior:
    scaleDown:
    stabilizationWindowSeconds: 600 #等待10分钟后再开始缩容。
    policies:
    - type: pods
    value: 5 #每次只缩容5个Pod。

    #上述示例表示当负载下降时,系统会等待600秒(10分钟)后再次进行缩容,每次只缩容5个Pod。

  • 问:如果执行kubectl get hpa命令后,发现target一栏为UNKNOWN怎么办?

    答:请按照以下步骤进行验证:

    a、确认HorizontalPodAutoscaler中指标名字是否正确。
    b、执行kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"命令,查看返回结果中是否有对应的指标。
    c、检查ack-alibaba-cloud-metrics-adapter中配置的Prometheus url是否正确。

手机扫一扫

移动阅读更方便

阿里云服务器
腾讯云服务器
七牛云服务器