原创 使用 GPU-Operator 与 KubeSphere 简化深度学习训练与 GPU 监控

发布时间:2021-06-24 12:51:12 浏览 32 来源:猿笔记 作者:KubeSphere

    众所周知。Kubernetes平台通过设备插件框架提供对特殊硬件资源的访问,如NVIDIAGPU、网卡、Infiniband适配器和其他设备,使用这些硬件资源配置和管理节点需要配置多个软件组件。如驱动程序、容器运行时或其他依赖库,这是困难的和容易出错的,###前提条件,1*节点发现(NFD)需要在每个节点上配置,默认情况会直接安装,如果已经配置,请在`Helmchart`变量设置`nfd.enabled`为`false`:再安装:####支持的容器运行时7如果需要指定驱动版本可参考如下10**由于安装的镜像比较大所以初次安装过程中可能会出现超时的情形


    本文将从GPU-Operator概念介绍、安装部署、深度训练测试应用部署,以及在[KubeSphere](

    ##GPU-Operator简介

    众所周知,Kubernetes平台通过设备插件框架提供对特殊硬件资源的访问,如NVIDIAGPU、网卡、Infiniband适配器等设备。然而,使用这些硬件资源来配置和管理节点需要配置多个软件组件,例如驱动程序、容器运行时或其他相关库,这是困难的并且容易出错。

    [NVIDIAGPUOperator](

    ###GPU-Operator架构原理

    前文提到,NVIDIAGPUOperator管理GPU节点就像管理CPU节点一样方便,那么它是如何实现这一能力呢?

    我们一起来看看GPU-Operator运行时的架构图:

    通过图中的描述,我们可以知道,GPU-Operator是通过实现了Nvidia容器运行时,以`runC`作为输入,在`runC`中`preStarthook`中注入了一个名叫`nvidia-container-toolkit`的脚本,该脚本调用`libnvidia-containerCLI`设置一系列合适的`flags`,使得容器运行后具有GPU能力。

    ##GPU-Operator安装说明

    # # #先决条件

    在安装GPUOperator之前,请配置好安装环境如下:

    *所有节点**不需要**预先安装NVIDIA组件(`driver`,`containerruntime`,`deviceplugin`);

    *所有节点必须配置`Docker`,`cri-o`,或者`containerd`.对于docker来说,可以参考[这里](

    *如果使用HWE内核(e.g.kernel5.x)的Ubuntu18.04LTS环境下,需要给`nouveaudriver`添加黑名单,需要更新`initramfs`;

    plain$sudovim/etc/modprobe.d/blacklist.conf#在尾部添加黑名单blacklistnouveauoptionsnouveaumodeset=0$sudoupdate-initramfs-u$reboot$lsmod|grepnouveau#验证nouveau是否已禁用$cat/proc/cpuinfo|grepname|cut-f2-d:|uniq-c#本文测试时处理器架构代号为Broadwell16IntelCoreProcessor(Broadwell)

    *需要在每个节点上配置节点发现(NFD ),默认情况下将直接安装。如果已配置,请在安装前在Helmchart变量中将nfd.enabled设置为false。

    *如果使用Kubernetes1.13和1.14,需要激活[KubeletPodResources](

    ####支持的linux版本

    |**OSName/Version**|**Identifier**|**amd64/x86_64**|**ppc64le**|**arm64/aarch64**|

    |:--------------------|:-------------|:-----------------|:----------|:------------------|

    |AmazonLinux1|amzn1|X|||

    |AmazonLinux2|amzn2|X|||

    |AmazonLinux2017.09|amzn2017.09|X|||

    |AmazonLinux2018.03|amzn2018.03|X|||

    |OpenSuseLeap15.0|sles15.0|X|||

    |OpenSuseLeap15.1|sles15.1|X|||

    |DebianLinux9|debian9|X|||

    |DebianLinux10|debian10|X|||

    |Centos7|centos7|X|X||

    |Centos8|centos8|X|X|X|

    |RHEL7.4|rhel7.4|X|X||

    |RHEL7.5|rhel7.5|X|X||

    |RHEL7.6|rhel7.6|X|X||

    |RHEL7.7|rhel7.7|X|X||

    |RHEL8.0|rhel8.0|X|X|X|

    |RHEL8.1|rhel8.1|X|X|X|

    |RHEL8.2|rhel8.2|X|X|X|

    |Ubuntu16.04|ubuntu16.04|X|X||

    |Ubuntu18.04|ubuntu18.04|X|X|X|

    |Ubuntu20.04|ubuntu20.04|X|X|X|

    # # # #支持的容器运行时

    |**OSName/Version**|**amd64/x86_64**|**ppc64le**|**arm64/aarch64**|

    |:--------------------|:-----------------|:----------|:------------------|

    |Docker18.09|X|X|X|

    |Docker19.03|X|X|X|

    |RHEL/CentOS8podman|X|||

    |CentOS8Docker|X|||

    |RHEL/CentOS7Docker|X|||

    ###安装doker环境

    可参考[Docker官方文档](

    ###安装NVIDIADocker

    配置stable仓库和GPGkey:

    plain$distribution=$(./etc/os-release;echo$ID$VERSION_ID)\\&&curl-s-Lhttps://nvidia.github.io/nvidia-docker/gpgkey|sudoapt-keyadd-\\&&curl-s-Lhttps://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list|sudotee/etc/apt/sources.list.d/nvidia-docker.list

    更新软件仓库后安装`nvidia-docker2`并添加运行时配置:

    shell$sudoapt-getupdate$sudoapt-getinstall-ynvidia-docker2--WhatwouldyouliketodoaboutitYouroptionsare:YorI:installthepackagemaintainer'sversionNorO:keepyourcurrently-installedversionD:showthedifferencesbetweentheversionsZ:startashelltoexaminethesituation--#初次安装,遇到以上交互式问题可选择N#如果选择Y会覆盖你的一些默认配置#选择N后,将以下配置添加到etc/docker/daemon.json{"runtimes":{"nvidia":{"path":"/usr/bin/nvidia-container-runtime","runtimeArgs":[]}}}

    重启`docker`:

    plain$sudosystemctlrestartdocker

    ###安装Helm

    plain$curl-fsSL-oget_helm.shhttps://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3\\&&chmod700get_helm.sh\\&&./get_helm.sh

    添加`helm`仓库

    plain$helmrepoaddnvidiahttps://nvidia.github.io/gpu-operator\\&&helmrepoupdate

    ###安装NVIDIAGPUOperator

    ####dockerasruntime

    plain$kubectlcreatensgpu-operator-resources$helminstallgpu-operatornvidia/gpu-operator-ngpu-operator-resources--wait

    如果需要指定驱动程序版本,请参考以下内容:

    plain$helminstallgpu-operatornvidia/gpu-operator-ngpu-operator-resources\\--setdriver.version="450.80.02"

    ####crioasruntime

    plainhelminstallgpu-operatornvidia/gpu-operator-ngpu-operator-resources\\--setoperator.defaultRuntime=crio

    ####containerdasruntime

    plainhelminstallgpu-operatornvidia/gpu-operator-ngpu-operator-resources\\--setoperator.defaultRuntime=containerdFurthermore,whensettingcontainerdasthedefaultRuntimethefollowingoptionsarealsoavailable:toolkit:env:-name:CONTAINERD_CONFIGvalue:/etc/containerd/config.toml-name:CONTAINERD_SOCKETvalue:/run/containerd/containerd.sock-name:CONTAINERD_RUNTIME_CLASSvalue:nvidia-name:CONTAINERD_SET_AS_DEFAULTvalue:true

    * *因为安装的映像相对较大,所以在初始安装期间可能会超时。请检查您的图像是否被拉!考虑使用离线安装解决这类问题,参考离线安装的链接。**

    ####使用values.yaml安装

    plain$helminstallgpu-operatornvidia/gpu-operator-ngpu-operator-resources-fvalues.yaml

    # # #考虑离线安装](

    # #应用部署

    ###检查已部署operator服务状态

    ####检查pods状态

    plain$kubectlgetpods-ngpu-operator-resourcesNAMEREADYSTATUSRESTARTSAGEgpu-feature-discovery-4gk781/1Running035sgpu-operator-858fc55fdb-jv4881/1Running02m52sgpu-operator-node-feature-discovery-master-7f9ccc4c7b-2sg6r1/1Running02m52sgpu-operator-node-feature-discovery-worker-cbkhn1/1Running02m52sgpu-operator-node-feature-discovery-worker-m8jcm1/1Running02m52snvidia-container-toolkit-daemonset-tfwqt1/1Running02m42snvidia-dcgm-exporter-mqns51/1Running038snvidia-device-plugin-daemonset-7npbs1/1Running053snvidia-device-plugin-validation0/1Completed049snvidia-driver-daemonset-hgv6s1/1Running02m47s

    # # # #检查节点资源是否可供分配

    plain$kubectldescribenodeworker-gpu-001Allocatable:cpu:15600mephemeral-storage:82435528Kihugepages-2Mi:0memory:63649242267nvidia.com/gpu:1#checkherepods:110---

    # # #在正式文档中部署两个实例

    # # # #示例1

    plain$catcuda-load-generator.yamlapiVersion:v1kind:Podmetadata:name:dcgmproftesterspec:restartPolicy:OnFailurecontainers:-name:dcgmproftester11image:nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04args:["--no-dcgm-validation","-t1004","-d120"]resources:limits:nvidia.com/gpu:1securityContext:capabilities:add:["SYS_ADMIN"]EOF

    # # # #示例2

    plain$curl-LOhttps://nvidia.github.io/gpu-operator/notebook-example.yml$catnotebook-example.ymlapiVersion:v1kind:Servicemetadata:name:tf-notebooklabels:app:tf-notebookspec:type:NodePortports:-port:80name:httptargetPort:8888nodePort:30001selector:app:tf-notebookapiVersion:v1kind:Podmetadata:name:tf-notebooklabels:app:tf-notebookspec:securityContext:fsGroup:0containers:-name:tf-notebookimage:tensorflow/tensorflow:latest-gpu-jupyterresources:limits:nvidia.com/gpu:1ports:-containerPort:8

    ###基于JupyterNotebook应用运行深度学习训练任务

    # # # #部署应用程序

    plain$kubectlapply-fcuda-load-generator.yamlpod/dcgmproftestercreated$kubectlapply-fnotebook-example.ymlservice/tf-notebookcreatedpod/tf-notebookcreated

    检查图形处理器是否处于已分配状态:

    plain$kubectldescribenodeworker-gpu-001Allocatedresources:(Totallimitsmaybeover100percent,i.e.,overcommitted.)ResourceRequestsLimits-------------------cpu1087m(6%)1680m(10%)memory1440Mi(2%)1510Mi(2%)ephemeral-storage0(0%)0(0%)nvidia.com/gpu11#checkthisEvents:

    当一个GPU任务发布到平台时,GPU资源从可分配状态变为已分配状态,安装任务发布的顺序。第二个任务在第一个任务运行后开始运行:

    plain$kubectlgetpods--watchNAMEREADYSTATUSRESTARTSAGEdcgmproftester1/1Running076stf-notebook0/1Pending058s---NAMEREADYSTATUSRESTARTSAGEdcgmproftester0/1Completed04m22stf-notebook1/1Running04m4s

    获取应用程序端口信息:

    plain$kubectlgetsvc#getthenodeportofthesvc,30001gpu-operator-1611672791-node-feature-discoveryClusterIP10.233.10.2228080/TCP12hkubernetesClusterIP10.233.0.1443/TCP12htf-notebookNodePort10.233.53.11680:30001/TCP7m52s

    检查日志并获取登录密码:

    $kubectllogstf-notebook

    [I21:50:23.188NotebookApp]Writingnotebookservercookiesecretto/root/.local/share/jupyter/runtime/notebook_cookie_secret

    [I21:50:23.390NotebookApp]Servingnotebooksfromlocaldirectory:/tf

    [I21:50:23.391NotebookApp]TheJupyterNotebookisrunningat:

    [I21:50:23.391

    [I21:50:23.391

    [I21:50:23.391NotebookApp]UseControl-Ctostopthisserverandshutdownallkernels(twicetoskipconfirmation).

    [C21:50:23.394NotebookApp]

    Toaccessthenotebook,openthisfileinabrowser:

    OrcopyandpasteoneoftheseURLs:

    # # # #运行深度学习任务

    进入`jupyternotebook`环境后,尝试进入终端,运行深度学习任务:

    进入`terminal`后拉取`tersorflow`测试代码并运行:

    同时开启另一个终端运行nvidia-smi '检查GPU监控的使用情况:

    ##利用KubeSphere自定义监控功能监控GPU

    ###部署ServiceMonitor

    `gpu-operator`帮我们提供了`nvidia-dcgm-exporter`这个`exportor`,只需要将它集成到`Prometheus`的可采集对象中,也就是`ServiceMonitor`中,我们就能获取GPU监控数据了:

    plain$kubectlgetpods-ngpu-operator-resourcesNAMEREADYSTATUSRESTARTSAGEgpu-feature-discovery-ff4ng1/1Running215hnvidia-container-toolkit-daemonset-2vxjz1/1Running015hnvidia-dcgm-exporter-pqwfv1/1Running05h27m#herenvidia-device-plugin-daemonset-42n741/1Running05h27mnvidia-device-plugin-validation0/1Completed05h27mnvidia-driver-daemonset-dvd9r1/1Running315h

    可以构建一个`busybox`查看该`exporter`暴露的指标:

    plain

    $kubectlgetsvc-ngpu-operator-resources

    NAMETYPECLUSTER-IPEXTERNAL-IPPORT(S)AGE

    gpu-operator-node-feature-discoveryClusterIP10.233.54.1118080/TCP56m

    nvidia-dcgm-exporterClusterIP10.233.53.1969400/TCP54m

    $kubectlexec-itbusybox-sleep--sh

    $

    $catmetrics

    -DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-eeff7856-475a-2eb7-6408-48d023d9dd28",device="nvidia0",container="tf-notebook",namespace="default",pod="tf-notebook\

作者信息

KubeSphere [等级:3] 开源的容器平台
发布了 19 篇专栏 · 获得点赞 21 · 获得阅读 13691

相关推荐 更多