Skip to content

Instantly share code, notes, and snippets.

@ZenanH
Created February 15, 2024 14:15
Show Gist options
  • Select an option

  • Save ZenanH/1a83b8d737e878bd4c27ae3f1bf7cbdd to your computer and use it in GitHub Desktop.

Select an option

Save ZenanH/1a83b8d737e878bd4c27ae3f1bf7cbdd to your computer and use it in GitHub Desktop.
using MPI
using CUDA
using Printf
function print_aligned_matrix_with_color(matrix::Array{String,2})
@info "MPI GPU Unidirectional P2P Bandwidth Test [GB/s]"
col_widths = [maximum(length.(matrix[:, i])) for i in 1:size(matrix, 2)]
for row in 1:size(matrix, 1)
for col in 1:size(matrix, 2)
if row == 1 || col == 1
print("\e[1;32m", lpad(matrix[row, col], col_widths[col]), "\e[0m ")
else
print(lpad(matrix[row, col], col_widths[col]), " ")
end
end
println()
end
end
# unidirectional p2p test
function main(gpu_num)
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm) # [0 1] only two ranks
n = 16384
nbench = 5
if rank == 0
result = Array{String, 2}(undef, gpu_num+1, gpu_num+1)
result[1, :] .= ["GPU/GPU"; string.(0:1:gpu_num-1)]
result[:, 1] .= ["GPU/GPU"; string.(0:1:gpu_num-1)]
end
# two ranks control two GPUs, each pair of GPUs will test `nbench` times (i.e. 5 times)
for dev_src in 0:1:gpu_num-1, dev_dst in 0:1:gpu_num-1
# prepare data on devices
if rank == 0
CUDA.device!(dev_src)
send_mesg = CUDA.rand(Float32, n, n)
datasize = sizeof(send_mesg)/1024^3
trst = zeros(nbench) # save the runtime results of 5 tests
elseif rank == 1
CUDA.device!(dev_dst)
recv_mesg = CUDA.zeros(Float32, n, n)
end
CUDA.synchronize()
# start to test for 5 times
for itime in 1:nbench
if rank == 0
tic = MPI.Wtime()
MPI.Send(send_mesg, comm; dest=1, tag=666)
elseif rank == 1
rreq = MPI.Irecv!(recv_mesg, comm; source=0, tag=666)
MPI.Wait(rreq)
end
CUDA.synchronize()
MPI.Barrier(comm)
if rank == 0
toc = MPI.Wtime()
trst[itime] = toc-tic
end
end
# the average time will only take from the 2nd to the 5th
# convert GiB to GB
if rank == 0
speed = @sprintf "%.2f" datasize/(sum(trst[2:end])/(nbench-1)) * (2^30)/(10^9)
result[dev_src+2, dev_dst+2] = speed
end
end
# print the results
if rank == 0
print_aligned_matrix_with_color(result)
end
return nothing
end
gpu_num = parse(Int, ARGS[1])
main(gpu_num)
@ZenanH
Copy link
Author

ZenanH commented Feb 15, 2024

And my question is:

Why the bandwidth on the same GPU device is just around half of the NVIDIA test?

@luraess
Copy link

luraess commented Feb 16, 2024

Which module are you using on node40?

Also, can you try mpirun -n 2 --bind-to socket julia -O3 --color=yes t5_mpi_p2p.jl 8 to enforce better binding given the NUMA crap?

@luraess
Copy link

luraess commented Feb 16, 2024

Also, unless I miss some understanding, your using CUDA-aware MPI to perform the copying, while Nvidia sample you're point at do not use MPI but some kind of direct peer access.

@luraess
Copy link

luraess commented Feb 16, 2024

I guess if you want to compare, we would need to rather use a kind of fancy coptyto! that CUDA.jl would overload to allow for peer access instead of CUDA-aware MPI.

@luraess
Copy link

luraess commented Feb 16, 2024

@ZenanH
Copy link
Author

ZenanH commented Feb 16, 2024

Which module are you using on node40?

Also, can you try mpirun -n 2 --bind-to socket julia -O3 --color=yes t5_mpi_p2p.jl 8 to enforce better binding given the NUMA crap?

The module I loaded on node40:

# !! User specific aliases and functions
module purge > /dev/null 2>&1
module load tbb/4.4
module load openmpi/gcc83-316-c112
module load cuda/12.1
module load matlab/R2020b
module load Qt/5.1.1

I tried your command, the bandwidth is a little bit larger, but it still has problems on the same device. Below is the result by running: mpirun -n 2 --bind-to socket julia -O3 --color=yes t5_mpi_p2p.jl 8
image

@ZenanH
Copy link
Author

ZenanH commented Feb 16, 2024

I guess if you want to compare, we would need to rather use a kind of fancy coptyto! that CUDA.jl would overload to allow for peer access instead of CUDA-aware MPI.

Yes! Recently, I check the methods to manage multi-gpu. It seems like for the V100s on Octopus, there are two ways to do it:

  • CUDA p2p
  • CUDA-aware MPI

To my understand, CUDA p2p will be the fastest way to manage devices, however, I cannot do it in julia easily (I don't know how to use low level api to do in-kernel p2p😅). So, I want to compare the bandwidth between these two methods.

CUDA.jl already provided copyto! to transfer data through different devices, that is what used in GPUInspector.jl. I checked their code, they will reture nothing when the data transfer happening on the same device. Below is the result from GPUInspector.jl:

8×8 Matrix{Union{Nothing, Float64}}:
   nothing  22.5336    22.531     44.9758    44.9862     8.06306    8.4857     8.48199
 22.5205      nothing  44.9897    22.5323     8.07263   45.0001     8.47995    8.46755
 22.5236    45.0001      nothing  45.0001     7.86415    7.87654   22.5314     7.86654
 44.9688    22.5319    44.9949      nothing   7.86638    7.87836    7.8814    22.5323
 44.9949     8.0966     8.48224    8.48644     nothing  22.534     22.5332    44.9966
  8.09356   44.9949     8.48428    8.46842   22.534       nothing  44.9914    22.5349
  7.8758     7.875     22.5327     7.8798    22.5327    44.9879      nothing  44.9966
  7.86239    7.87468    7.8799    22.5384    44.981     22.5292    44.9932      nothing

Although the result still has a problem on the same device, different devices' data transfer seems fine compared to CUDA p2p, it's ok for my code. But if we can use CUDA in-kernel p2p it will be better. Because CUDA in-kernel p2p allows us to do some computing during communication instead of copying data first and then computing the data, which is good for small data transfer. Do you know some examples to do it by using CUDA.jl? It should look like this:

device!(0)
src = CUDA.rand(10)
device!(1)
dst = CUDA.rand(10)

@kernel function main(src, dst)
    ix = @index(Global)
    dst[ix] += src[ix]^2
    return nothing
end

main(CUDABackend())(ndrange=10, src, dst)

@ZenanH
Copy link
Author

ZenanH commented Feb 16, 2024

And I have two questions:

Normally, our workflow is copying the data to the dst device, and then doing computing on the dst device. This means that MPI only transfers the data across devices.

Is that correct? Or maybe we also can do some custom computing in CUDA-aware MPI while copying the data, like CUDA in-kernel p2p ?


When we have some data for the simulation on Multi-GPU, will you save them in a struct? For example, a struct called Particle which has some fields like velocity, mass, etc.

If so, the struct which contains CuArray must be immutable, so how you change the size of the data, such as the length of velocity field was 100, it will change to 157 in the next time step because the particles are moving across devices.

@luraess
Copy link

luraess commented Feb 20, 2024

MPI only transfers the data across devices.

Yes, MPI we use only to transfer data among devices as ideally we do not need any offloading to CPU during on GPU compute (our local model fits into the GPU RAM)

simulation on Multi-GPU, will you save them in a struct

Struct or not, we will have to save some things yes. Indeed, to have good perf in distributed, we may need to be slightly more "static" then could be. This approach is followed in JustPIC.jl. I guess one should do similar things, or maybe even use that package.

@ZenanH
Copy link
Author

ZenanH commented Feb 20, 2024

Thanks, I will check this package~😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment