Skip to content

Instantly share code, notes, and snippets.

@ZenanH
Created February 15, 2024 14:15
Show Gist options
  • Select an option

  • Save ZenanH/1a83b8d737e878bd4c27ae3f1bf7cbdd to your computer and use it in GitHub Desktop.

Select an option

Save ZenanH/1a83b8d737e878bd4c27ae3f1bf7cbdd to your computer and use it in GitHub Desktop.
using MPI
using CUDA
using Printf
function print_aligned_matrix_with_color(matrix::Array{String,2})
@info "MPI GPU Unidirectional P2P Bandwidth Test [GB/s]"
col_widths = [maximum(length.(matrix[:, i])) for i in 1:size(matrix, 2)]
for row in 1:size(matrix, 1)
for col in 1:size(matrix, 2)
if row == 1 || col == 1
print("\e[1;32m", lpad(matrix[row, col], col_widths[col]), "\e[0m ")
else
print(lpad(matrix[row, col], col_widths[col]), " ")
end
end
println()
end
end
# unidirectional p2p test
function main(gpu_num)
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm) # [0 1] only two ranks
n = 16384
nbench = 5
if rank == 0
result = Array{String, 2}(undef, gpu_num+1, gpu_num+1)
result[1, :] .= ["GPU/GPU"; string.(0:1:gpu_num-1)]
result[:, 1] .= ["GPU/GPU"; string.(0:1:gpu_num-1)]
end
# two ranks control two GPUs, each pair of GPUs will test `nbench` times (i.e. 5 times)
for dev_src in 0:1:gpu_num-1, dev_dst in 0:1:gpu_num-1
# prepare data on devices
if rank == 0
CUDA.device!(dev_src)
send_mesg = CUDA.rand(Float32, n, n)
datasize = sizeof(send_mesg)/1024^3
trst = zeros(nbench) # save the runtime results of 5 tests
elseif rank == 1
CUDA.device!(dev_dst)
recv_mesg = CUDA.zeros(Float32, n, n)
end
CUDA.synchronize()
# start to test for 5 times
for itime in 1:nbench
if rank == 0
tic = MPI.Wtime()
MPI.Send(send_mesg, comm; dest=1, tag=666)
elseif rank == 1
rreq = MPI.Irecv!(recv_mesg, comm; source=0, tag=666)
MPI.Wait(rreq)
end
CUDA.synchronize()
MPI.Barrier(comm)
if rank == 0
toc = MPI.Wtime()
trst[itime] = toc-tic
end
end
# the average time will only take from the 2nd to the 5th
# convert GiB to GB
if rank == 0
speed = @sprintf "%.2f" datasize/(sum(trst[2:end])/(nbench-1)) * (2^30)/(10^9)
result[dev_src+2, dev_dst+2] = speed
end
end
# print the results
if rank == 0
print_aligned_matrix_with_color(result)
end
return nothing
end
gpu_num = parse(Int, ARGS[1])
main(gpu_num)
@ZenanH
Copy link
Author

ZenanH commented Feb 16, 2024

And I have two questions:

Normally, our workflow is copying the data to the dst device, and then doing computing on the dst device. This means that MPI only transfers the data across devices.

Is that correct? Or maybe we also can do some custom computing in CUDA-aware MPI while copying the data, like CUDA in-kernel p2p ?


When we have some data for the simulation on Multi-GPU, will you save them in a struct? For example, a struct called Particle which has some fields like velocity, mass, etc.

If so, the struct which contains CuArray must be immutable, so how you change the size of the data, such as the length of velocity field was 100, it will change to 157 in the next time step because the particles are moving across devices.

@luraess
Copy link

luraess commented Feb 20, 2024

MPI only transfers the data across devices.

Yes, MPI we use only to transfer data among devices as ideally we do not need any offloading to CPU during on GPU compute (our local model fits into the GPU RAM)

simulation on Multi-GPU, will you save them in a struct

Struct or not, we will have to save some things yes. Indeed, to have good perf in distributed, we may need to be slightly more "static" then could be. This approach is followed in JustPIC.jl. I guess one should do similar things, or maybe even use that package.

@ZenanH
Copy link
Author

ZenanH commented Feb 20, 2024

Thanks, I will check this package~😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment