DGX A100におけるMIGとジョブスケジューラUGEの動作検証
ジーデップ・アドバンスでは自社のDGX A100テストドライブ機を使ってMIG(Multi-instance GPU)と、Docker利用のGPUクラスタではメジャーなジョブスケジューラーであるUGE(Univa Grid Engine)の動作検証を国内でUGEで多数の導入事績を持つULGS株式会社の協力のもと行ってみました。
nvidia DGX-A100 を使って、UGE から、MIG 対応の Docker ジョブを流すことができるか早速検証です。まずは、MIG instance の分割方法は大まかに考えて2種類あると思いますが、https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
今回の検証では、GPU instanceを一番大きな MIG 7g.40gb で大きく切り、その中の SM だけ 14 Unitずつ分離するように、Compute instance 7_1_slice を 7 つ作り、8 GPU すべてに適用すると、56 instance が作成されます。nvidia-sim -L を実行するとその device UUID が表示されるので、UGE にその UUID 情報を登録しておきます。具体的には下記のように、mig という RSMAP complex を作っておき、56 instance 分登録しておきます。
|
そして、ジョブを流してみましょう。下記にある通り
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html
MIG instance を docker container で取り扱う場合には、NVIDIA_VISIBLE_DEVICES に UUID を指定すればいいとのことなので、UGE の Complex に登録しておいて、重複しないようにその情報を取り出してもらおうと思います。では56ジョブを下記の要領で流してみましょう。
for i in `seq 0 55`
do
qsub -l mig=1,docker,docker_images="*nvcr.io/nvidia/cuda:11.0-sample*" -xd '--runtime=nvidia,-e NVIDIA_VISIBLE_DEVICES=${mig(0)}' ./test-bench.sh
done
スクリプトの中身はこのようになり、nbody の benchmark を実行するだけの内容です。
# cat test-bench.sh
#!/bin/bash
#$ -S /bin/bash
/usr/local/cuda/samples/bin/x86_64/linux/release/nbody --benchmark -numbodies=409600
ここで、nvcr.io/nvidia/cuda:11-0-sample というイメージは一般に存在するわけではありません。devel から cuda sameple の nbody を実行するために、sample をインストールして、ビルドしたイメージになります。これが実行されている状態で nvidia-smi を実行してみると、
clouduser@dgxa100-01:~$ nvidia-smi
Fri Sep 4 02:11:22 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | On |
| N/A 47C P0 243W / 400W | 990MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:0F:00.0 Off | On |
| N/A 45C P0 238W / 400W | 990MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:47:00.0 Off | On |
| N/A 38C P0 191W / 400W | 732MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:4E:00.0 Off | On |
| N/A 28C P0 43W / 400W | 87MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB On | 00000000:87:00.0 Off | On |
| N/A 32C P0 43W / 400W | 11MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB On | 00000000:90:00.0 Off | On |
| N/A 31C P0 45W / 400W | 0MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB On | 00000000:B7:00.0 Off | On |
| N/A 31C P0 42W / 400W | 0MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB On | 00000000:BD:00.0 Off | On |
| N/A 31C P0 45W / 400W | 0MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 0 0 0 | 990MiB / 40537MiB | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 0 0 1 1 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 0 0 2 2 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 0 0 3 3 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 0 0 4 4 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 0 0 5 5 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 0 0 6 6 | | 14 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
| 1 0 0 0 | 990MiB / 40537MiB | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 1 0 1 1 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 1 0 2 2 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 1 0 3 3 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 1 0 4 4 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 1 0 5 5 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 1 0 6 6 | | 14 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
| 2 0 0 0 | 990MiB / 40537MiB | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 2 0 1 1 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 2 0 2 2 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 2 0 3 3 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 2 0 4 4 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 2 0 5 5 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 2 0 6 6 | | 14 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
| 3 0 0 0 | 990MiB / 40537MiB | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 3 0 1 1 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 3 0 2 2 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 3 0 3 3 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 3 0 4 4 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 3 0 5 5 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 3 0 6 6 | | 14 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
| 4 0 0 0 | 522MiB / 40537MiB | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 4 0 1 1 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 4 0 2 2 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 4 0 3 3 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 4 0 4 4 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 4 0 5 5 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 4 0 6 6 | | 14 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
| 5 0 0 0 | 399MiB / 40537MiB | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 5 0 1 1 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 5 0 2 2 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 5 0 3 3 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 5 0 4 4 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 5 0 5 5 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 5 0 6 6 | | 14 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
| 6 0 0 0 | 39MiB / 40537MiB | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 6 0 1 1 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 6 0 2 2 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 6 0 3 3 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 6 0 4 4 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 6 0 5 5 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 6 0 6 6 | | 14 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
| 7 0 0 0 | 73MiB / 40537MiB | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 7 0 1 1 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 7 0 2 2 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 7 0 3 3 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 7 0 4 4 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 7 0 5 5 | | 14 0 | 7 0 5 1 1 |
+------------------+ +-----------+-----------------------+
| 7 0 6 6 | | 14 0 | 7 0 5 1 1 |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 0 0 51971 C ...86_64/linux/release/nbody 141MiB |
| 0 0 1 52053 C ...86_64/linux/release/nbody 141MiB |
| 0 0 2 52050 C ...86_64/linux/release/nbody 141MiB |
| 0 0 3 52127 C ...86_64/linux/release/nbody 141MiB |
| 0 0 4 52742 C ...86_64/linux/release/nbody 141MiB |
| 0 0 5 52749 C ...86_64/linux/release/nbody 141MiB |
| 0 0 6 53150 C ...86_64/linux/release/nbody 141MiB |
| 1 0 0 52745 C ...86_64/linux/release/nbody 141MiB |
| 1 0 1 53147 C ...86_64/linux/release/nbody 141MiB |
| 1 0 2 54211 C ...86_64/linux/release/nbody 141MiB |
| 1 0 3 54214 C ...86_64/linux/release/nbody 141MiB |
| 1 0 4 55582 C ...86_64/linux/release/nbody 141MiB |
| 1 0 5 55575 C ...86_64/linux/release/nbody 141MiB |
| 1 0 6 55654 C ...86_64/linux/release/nbody 141MiB |
| 2 0 0 55661 C ...86_64/linux/release/nbody 141MiB |
| 2 0 1 55663 C ...86_64/linux/release/nbody 141MiB |
| 2 0 2 55813 C ...86_64/linux/release/nbody 141MiB |
| 2 0 3 55927 C ...86_64/linux/release/nbody 141MiB |
| 2 0 4 55894 C ...86_64/linux/release/nbody 141MiB |
| 2 0 5 55969 C ...86_64/linux/release/nbody 141MiB |
| 2 0 6 55972 C ...86_64/linux/release/nbody 141MiB |
| 3 0 0 55975 C ...86_64/linux/release/nbody 141MiB |
| 3 0 1 56048 C ...86_64/linux/release/nbody 141MiB |
| 3 0 2 56055 C ...86_64/linux/release/nbody 141MiB |
| 3 0 3 56061 C ...86_64/linux/release/nbody 141MiB |
| 3 0 4 56058 C ...86_64/linux/release/nbody 141MiB |
| 3 0 5 56064 C ...86_64/linux/release/nbody 141MiB |
| 3 0 6 56145 C ...86_64/linux/release/nbody 141MiB |
| 4 0 0 56172 C ...86_64/linux/release/nbody 141MiB |
| 4 0 1 56168 C ...86_64/linux/release/nbody 141MiB |
| 4 0 2 56197 C ...86_64/linux/release/nbody 141MiB |
| 4 0 3 56209 C ...86_64/linux/release/nbody 141MiB |
| 4 0 4 56352 C ...86_64/linux/release/nbody 141MiB |
| 4 0 5 56293 C ...86_64/linux/release/nbody 141MiB |
| 4 0 6 56359 C ...86_64/linux/release/nbody 141MiB |
| 5 0 0 56364 C ...86_64/linux/release/nbody 141MiB |
| 5 0 1 56355 C ...86_64/linux/release/nbody 141MiB |
| 5 0 2 56398 C ...86_64/linux/release/nbody 141MiB |
| 5 0 3 56373 C ...86_64/linux/release/nbody 141MiB |
| 5 0 4 56404 C ...86_64/linux/release/nbody 141MiB |
| 5 0 5 56401 C ...86_64/linux/release/nbody 141MiB |
| 5 0 6 56455 C ...86_64/linux/release/nbody 141MiB |
| 6 0 0 56661 C ...86_64/linux/release/nbody 141MiB |
| 6 0 1 56685 C ...86_64/linux/release/nbody 141MiB |
| 6 0 2 56605 C ...86_64/linux/release/nbody 141MiB |
| 6 0 3 56690 C ...86_64/linux/release/nbody 141MiB |
| 6 0 4 56707 C ...86_64/linux/release/nbody 99MiB |
| 6 0 5 56687 C ...86_64/linux/release/nbody 141MiB |
| 6 0 6 56665 C ...86_64/linux/release/nbody 141MiB |
| 7 0 0 56701 C ...86_64/linux/release/nbody 141MiB |
| 7 0 1 56679 C ...86_64/linux/release/nbody 141MiB |
| 7 0 2 56704 C ...86_64/linux/release/nbody 141MiB |
| 7 0 3 56693 C ...86_64/linux/release/nbody 141MiB |
| 7 0 4 56715 C ...86_64/linux/release/nbody 99MiB |
| 7 0 5 56722 C ...86_64/linux/release/nbody 99MiB |
| 7 0 6 56712 C ...86_64/linux/release/nbody 141MiB |
+-----------------------------------------------------------------------------+
きちんと、56 個に分離された MIG instance で nbody が実行されていることがわかります。ジョブの状態は qstat で見ると
ばらばらに 56 ジョブ流れていることがわかります。
clouduser@dgxa100-01:~$ qstat
job-ID prior name user state submit/start at queue jclass slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
139 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
140 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
141 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
142 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
143 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
144 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
145 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
146 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
147 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
148 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
149 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
150 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
151 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
152 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
153 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
154 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
155 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
156 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
157 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
158 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
159 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
160 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
161 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
162 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
163 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
164 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
165 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
166 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
167 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
168 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
169 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
170 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
171 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
172 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
173 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
174 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
175 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
176 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
177 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
178 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
179 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
180 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
181 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
182 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
183 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
184 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
185 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
186 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
187 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
188 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
189 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
190 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
191 0.55500 test-bench clouduser r 09/04/2020 02:11:07 all.q@dgxa100-01 1
例えば、job_id 190 の結果
cat test-bench.sh.o190
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.0
> Compute 8.0 CUDA device: [A100-SXM4-40GB MIG 1c.7g.40gb]
number of bodies = 409600
409600 bodies, total time for 10 iterations: 18523.258 ms
= 90.574 billion interactions per second
= 1811.476 single-precision GFLOP/s at 20 flops per interaction
実行終了した結果を見ると、1GPU に対して 1/7 – 1/8 程度の値になっており、nbodyぐらいの処理であれば、分割しても問題なく同時実行できることがわかりました。
今回の検証は、UGE 8.6.14 で行っていますが、このバージョンはまだ docker –gpu という指定に対応していないので、–runtime=nvidia を指定し、nvidia-docker v2 基準で実装しています。ただ、nvidia-docker v2 つまり –runtime=nvidia という指定は、今後なくなるといわれていましたが、結局、ほかの OCI に対応する関係上生き残っていくようです。
いくつか問題点も確認できていますが、すでに、Univa 社に問題点は報告済みとなり、修正版が出てくればそれらの問題は解決されます。現状でも、MIG Instance においても、Docker container を使用することで、互いに重複することなく利用できることは示すことができました。今回のテスト例では、簡単のために、一様な Compute Instance を使っていますが、GPU Instance また Compute Instance に異なる Instance Profile を使用しても問題なく動作することも確認できています。
今回の評価期間が短かったため、2つ(proc と device)の nvidia-capability でMIG instance がどのように働くか迄は確かめていませんので、また機会があった際には、 それらがどう働くかを見据えて、UGE による、MIG Instance の制御を追加確認したいと思います。
今回の検証では国内初のDGX A100テストドライブ機を利用しています。是非ご自身でDGX A100の性能を確かめたい。実データを使った検証を行ってみたいというお客様はお気軽にジーデップ・アドバンスへご相談ください。
Copyright©GDEP ADVANCE,Inc. All Rights reserved.