Last active
May 8, 2024 16:37
-
-
Save ZenanH/eb8fdff293b3ed50ee6b3ad2ef8e4033 to your computer and use it in GitHub Desktop.
Register Pressure
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| using KernelAbstractions | |
| using CUDA | |
| using BenchmarkTools | |
| @inline function uGIMPbasis(Δx::T2, h::T2, lp::T2) where T2 | |
| T2 == Float32 ? T1 = Int32 : | |
| T2 == Float64 ? T1 = Int64 : nothing | |
| absΔx = abs(Δx); signΔx = sign(Δx) | |
| c1 = absΔx<(T2(0.5)*lp) | |
| c2 = (T2(0.5)*lp)≤absΔx<(h-T2(0.5)*lp) | |
| c3 = (h-T2(0.5)*lp)≤absΔx<(h+T2(0.5)*lp) | |
| Ni1 = T2(1.0)-((T2(4)*(Δx^T1(2))+lp^T1(2))/(T2(4)*h*lp)) | |
| Ni2 = T2(1.0)-(absΔx/h) | |
| Ni3 = ((h+T2(0.5)*lp-absΔx)^T1(2))/(T2(2.0)*h*lp) | |
| dN1 = -((T2(8.0)*Δx)/(T2(4)*h*lp)) | |
| dN2 = signΔx*(-T2(1.0)/h) | |
| dN3 = -signΔx*((h+T2(0.5)*lp-absΔx)/(h*lp)) | |
| N = c1*Ni1+c2*Ni2+c3*Ni3 | |
| dN = c1*dN1+c2*dN2+c3*dN3 | |
| return T2(N), T2(dN) | |
| end | |
| @kernel inbounds = true function test1!(mp_p2n, mp_Ni, mp_∂Nx, mp_∂Ny, mp_∂Nz, mp_pos, grid_pos) | |
| ix = @index(Global) | |
| for iy in Int32(1):Int(64) | |
| p2n = mp_p2n[ix, iy] | |
| Nx, dNx = uGIMPbasis(mp_pos[ix, 1]-grid_pos[p2n, 1], Float64(2.0), Float64(1.0)) | |
| Ny, dNy = uGIMPbasis(mp_pos[ix, 2]-grid_pos[p2n, 2], Float64(3.0), Float64(2.0)) | |
| Nz, dNz = uGIMPbasis(mp_pos[ix, 3]-grid_pos[p2n, 3], Float64(4.0), Float64(3.0)) | |
| mp_Ni[ix, iy] = Nx * Ny * Nz | |
| mp_∂Nx[ix, iy] = dNx * Ny * Nz | |
| mp_∂Ny[ix, iy] = dNy * Nx * Nz | |
| mp_∂Nz[ix, iy] = dNz * Nx * Ny | |
| end | |
| end | |
| mp_p2n = cu(rand(1:1:710733, 512000, 64) .|> Int64) | |
| mp_Ni = CUDA.zeros(Float64, 512000, 64) | |
| mp_∂Nx = CUDA.zeros(Float64, 512000, 64) | |
| mp_∂Ny = CUDA.zeros(Float64, 512000, 64) | |
| mp_∂Nz = CUDA.zeros(Float64, 512000, 64) | |
| mp_pos = CUDA.rand(Float64, 512000, 3) | |
| grid_pos = CUDA.rand(Float64, 710733, 3) | |
| test1!(CUDABackend())(ndrange=512000, mp_p2n, mp_Ni, mp_∂Nx, mp_∂Ny, mp_∂Nz, mp_pos, grid_pos) | |
| time = 1e3 * @belapsed begin | |
| test1!($CUDABackend())(ndrange=$512000, $mp_p2n, $mp_Ni, $mp_∂Nx, $mp_∂Ny, $mp_∂Nz, $mp_pos, $grid_pos) | |
| CUDA.synchronize() | |
| end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @ZenanH ! Thanks for the follow-up, cool that you could make a progress with implementation and a speedup with loop unrolling =)
As for the point 1, I guess it makes sense to figure out exactly what happens here. I've made a version of your example with some extra kernels:
which gives the following output on A100 NVidia GPU:
Here's what happens in my opinion (ideally, one need to profile the code with Nsight, which I didn't do, so take it with a grain of salt):
For optimal performance on GPUs it is crucial that the global memory accesses are coalesced (see here for more info). To achieve that, we need to make sure that the consecutive threads (or work items in AMD/KernelAbstractions jargon) access consecutive array indices (there's more to that, but I'm not really an expert in GPU optimisation tbh and not the best person to explain this things). What happens in the transposed version of your code, i.e.,
kernel2!, is that e.g., on the first iteration of the inner loop, when,iy = 1, thread 1 accessesa[1,1], thread 2 accessesa[1,2]etc. Since Julia is column-major, that means that the neighbour threads access array elements that are 64 elements apart from each other. This is clearly not optimal for GPU access.Why then the first version (
kernel1!) is faster? I think that KernelAbstraction picks the block size (same as group size) that is proportional to the inner loop size, i.e., 64. Usually the preferred block size is proportional to the warp size, which is 32 on all Nvidia devices. If the picked block size is equal to 64, then the thread 1 accessesa[1,1]and thread 2 accessesa[2,1], which is optimal access pattern. I didn't think of it when we had this discussion at EGU, sorry =)Now, one solution would be to always fix the number of threads in a block to 64 and use
kernel1!, which is perfectly fine in your case. However, If the size of the inner loop is not 64 but something else you might experience sudden performance issues on some architectures. Another possibility is to manually control the group size and enforce correct memory access pattern. In kernelskernel3!andkernel4!I do exactly this by launchingngroups each having 64 threads. Then I use the local thread index instead of inner loop. As you can see from benchmarking, now we have exactly the opposite situation, with the transposed matrix being faster, and a bit faster than the old version as well.Finally, to decouple work size from the inner loop size, we can use something similar to the "grid-stride loop" (see here for more info). In kernel
5I apply this only to the first dimension in the array, however, it can be used for both dimensions, and then if for some reason you'll have more or less neighbour weights then 64, you can change it independently from the block size.