-
-
Save ZenanH/1a83b8d737e878bd4c27ae3f1bf7cbdd to your computer and use it in GitHub Desktop.
| using MPI | |
| using CUDA | |
| using Printf | |
| function print_aligned_matrix_with_color(matrix::Array{String,2}) | |
| @info "MPI GPU Unidirectional P2P Bandwidth Test [GB/s]" | |
| col_widths = [maximum(length.(matrix[:, i])) for i in 1:size(matrix, 2)] | |
| for row in 1:size(matrix, 1) | |
| for col in 1:size(matrix, 2) | |
| if row == 1 || col == 1 | |
| print("\e[1;32m", lpad(matrix[row, col], col_widths[col]), "\e[0m ") | |
| else | |
| print(lpad(matrix[row, col], col_widths[col]), " ") | |
| end | |
| end | |
| println() | |
| end | |
| end | |
| # unidirectional p2p test | |
| function main(gpu_num) | |
| MPI.Init() | |
| comm = MPI.COMM_WORLD | |
| rank = MPI.Comm_rank(comm) # [0 1] only two ranks | |
| n = 16384 | |
| nbench = 5 | |
| if rank == 0 | |
| result = Array{String, 2}(undef, gpu_num+1, gpu_num+1) | |
| result[1, :] .= ["GPU/GPU"; string.(0:1:gpu_num-1)] | |
| result[:, 1] .= ["GPU/GPU"; string.(0:1:gpu_num-1)] | |
| end | |
| # two ranks control two GPUs, each pair of GPUs will test `nbench` times (i.e. 5 times) | |
| for dev_src in 0:1:gpu_num-1, dev_dst in 0:1:gpu_num-1 | |
| # prepare data on devices | |
| if rank == 0 | |
| CUDA.device!(dev_src) | |
| send_mesg = CUDA.rand(Float32, n, n) | |
| datasize = sizeof(send_mesg)/1024^3 | |
| trst = zeros(nbench) # save the runtime results of 5 tests | |
| elseif rank == 1 | |
| CUDA.device!(dev_dst) | |
| recv_mesg = CUDA.zeros(Float32, n, n) | |
| end | |
| CUDA.synchronize() | |
| # start to test for 5 times | |
| for itime in 1:nbench | |
| if rank == 0 | |
| tic = MPI.Wtime() | |
| MPI.Send(send_mesg, comm; dest=1, tag=666) | |
| elseif rank == 1 | |
| rreq = MPI.Irecv!(recv_mesg, comm; source=0, tag=666) | |
| MPI.Wait(rreq) | |
| end | |
| CUDA.synchronize() | |
| MPI.Barrier(comm) | |
| if rank == 0 | |
| toc = MPI.Wtime() | |
| trst[itime] = toc-tic | |
| end | |
| end | |
| # the average time will only take from the 2nd to the 5th | |
| # convert GiB to GB | |
| if rank == 0 | |
| speed = @sprintf "%.2f" datasize/(sum(trst[2:end])/(nbench-1)) * (2^30)/(10^9) | |
| result[dev_src+2, dev_dst+2] = speed | |
| end | |
| end | |
| # print the results | |
| if rank == 0 | |
| print_aligned_matrix_with_color(result) | |
| end | |
| return nothing | |
| end | |
| gpu_num = parse(Int, ARGS[1]) | |
| main(gpu_num) |
Which module are you using on node40?
Also, can you try mpirun -n 2 --bind-to socket julia -O3 --color=yes t5_mpi_p2p.jl 8 to enforce better binding given the NUMA crap?
Also, unless I miss some understanding, your using CUDA-aware MPI to perform the copying, while Nvidia sample you're point at do not use MPI but some kind of direct peer access.
I guess if you want to compare, we would need to rather use a kind of fancy coptyto! that CUDA.jl would overload to allow for peer access instead of CUDA-aware MPI.
Which module are you using on node40?
Also, can you try
mpirun -n 2 --bind-to socket julia -O3 --color=yes t5_mpi_p2p.jl 8to enforce better binding given the NUMA crap?
The module I loaded on node40:
# !! User specific aliases and functions
module purge > /dev/null 2>&1
module load tbb/4.4
module load openmpi/gcc83-316-c112
module load cuda/12.1
module load matlab/R2020b
module load Qt/5.1.1I tried your command, the bandwidth is a little bit larger, but it still has problems on the same device. Below is the result by running: mpirun -n 2 --bind-to socket julia -O3 --color=yes t5_mpi_p2p.jl 8

I guess if you want to compare, we would need to rather use a kind of fancy
coptyto!that CUDA.jl would overload to allow for peer access instead of CUDA-aware MPI.
Yes! Recently, I check the methods to manage multi-gpu. It seems like for the V100s on Octopus, there are two ways to do it:
- CUDA p2p
- CUDA-aware MPI
To my understand, CUDA p2p will be the fastest way to manage devices, however, I cannot do it in julia easily (I don't know how to use low level api to do in-kernel p2p😅). So, I want to compare the bandwidth between these two methods.
CUDA.jl already provided copyto! to transfer data through different devices, that is what used in GPUInspector.jl. I checked their code, they will reture nothing when the data transfer happening on the same device. Below is the result from GPUInspector.jl:
8×8 Matrix{Union{Nothing, Float64}}:
nothing 22.5336 22.531 44.9758 44.9862 8.06306 8.4857 8.48199
22.5205 nothing 44.9897 22.5323 8.07263 45.0001 8.47995 8.46755
22.5236 45.0001 nothing 45.0001 7.86415 7.87654 22.5314 7.86654
44.9688 22.5319 44.9949 nothing 7.86638 7.87836 7.8814 22.5323
44.9949 8.0966 8.48224 8.48644 nothing 22.534 22.5332 44.9966
8.09356 44.9949 8.48428 8.46842 22.534 nothing 44.9914 22.5349
7.8758 7.875 22.5327 7.8798 22.5327 44.9879 nothing 44.9966
7.86239 7.87468 7.8799 22.5384 44.981 22.5292 44.9932 nothing
Although the result still has a problem on the same device, different devices' data transfer seems fine compared to CUDA p2p, it's ok for my code. But if we can use CUDA in-kernel p2p it will be better. Because CUDA in-kernel p2p allows us to do some computing during communication instead of copying data first and then computing the data, which is good for small data transfer. Do you know some examples to do it by using CUDA.jl? It should look like this:
device!(0)
src = CUDA.rand(10)
device!(1)
dst = CUDA.rand(10)
@kernel function main(src, dst)
ix = @index(Global)
dst[ix] += src[ix]^2
return nothing
end
main(CUDABackend())(ndrange=10, src, dst)And I have two questions:
Normally, our workflow is copying the data to the dst device, and then doing computing on the dst device. This means that MPI only transfers the data across devices.
Is that correct? Or maybe we also can do some custom computing in CUDA-aware MPI while copying the data, like CUDA in-kernel p2p ?
When we have some data for the simulation on Multi-GPU, will you save them in a struct? For example, a struct called
Particlewhich has some fields like velocity, mass, etc.
If so, the struct which contains CuArray must be immutable, so how you change the size of the data, such as the length of velocity field was 100, it will change to 157 in the next time step because the particles are moving across devices.
MPI only transfers the data across devices.
Yes, MPI we use only to transfer data among devices as ideally we do not need any offloading to CPU during on GPU compute (our local model fits into the GPU RAM)
simulation on Multi-GPU, will you save them in a struct
Struct or not, we will have to save some things yes. Indeed, to have good perf in distributed, we may need to be slightly more "static" then could be. This approach is followed in JustPIC.jl. I guess one should do similar things, or maybe even use that package.
Thanks, I will check this package~😊
And my question is:
Why the bandwidth on the same GPU device is just around half of the NVIDIA test?