Skip to content

Instantly share code, notes, and snippets.

@ZenanH
Last active May 8, 2024 16:37
Show Gist options
  • Select an option

  • Save ZenanH/eb8fdff293b3ed50ee6b3ad2ef8e4033 to your computer and use it in GitHub Desktop.

Select an option

Save ZenanH/eb8fdff293b3ed50ee6b3ad2ef8e4033 to your computer and use it in GitHub Desktop.
Register Pressure
using KernelAbstractions
using CUDA
using BenchmarkTools
@inline function uGIMPbasis(Δx::T2, h::T2, lp::T2) where T2
T2 == Float32 ? T1 = Int32 :
T2 == Float64 ? T1 = Int64 : nothing
absΔx = abs(Δx); signΔx = sign(Δx)
c1 = absΔx<(T2(0.5)*lp)
c2 = (T2(0.5)*lp)≤absΔx<(h-T2(0.5)*lp)
c3 = (h-T2(0.5)*lp)≤absΔx<(h+T2(0.5)*lp)
Ni1 = T2(1.0)-((T2(4)*(Δx^T1(2))+lp^T1(2))/(T2(4)*h*lp))
Ni2 = T2(1.0)-(absΔx/h)
Ni3 = ((h+T2(0.5)*lp-absΔx)^T1(2))/(T2(2.0)*h*lp)
dN1 = -((T2(8.0)*Δx)/(T2(4)*h*lp))
dN2 = signΔx*(-T2(1.0)/h)
dN3 = -signΔx*((h+T2(0.5)*lp-absΔx)/(h*lp))
N = c1*Ni1+c2*Ni2+c3*Ni3
dN = c1*dN1+c2*dN2+c3*dN3
return T2(N), T2(dN)
end
@kernel inbounds = true function test1!(mp_p2n, mp_Ni, mp_∂Nx, mp_∂Ny, mp_∂Nz, mp_pos, grid_pos)
ix = @index(Global)
for iy in Int32(1):Int(64)
p2n = mp_p2n[ix, iy]
Nx, dNx = uGIMPbasis(mp_pos[ix, 1]-grid_pos[p2n, 1], Float64(2.0), Float64(1.0))
Ny, dNy = uGIMPbasis(mp_pos[ix, 2]-grid_pos[p2n, 2], Float64(3.0), Float64(2.0))
Nz, dNz = uGIMPbasis(mp_pos[ix, 3]-grid_pos[p2n, 3], Float64(4.0), Float64(3.0))
mp_Ni[ix, iy] = Nx * Ny * Nz
mp_∂Nx[ix, iy] = dNx * Ny * Nz
mp_∂Ny[ix, iy] = dNy * Nx * Nz
mp_∂Nz[ix, iy] = dNz * Nx * Ny
end
end
mp_p2n = cu(rand(1:1:710733, 512000, 64) .|> Int64)
mp_Ni = CUDA.zeros(Float64, 512000, 64)
mp_∂Nx = CUDA.zeros(Float64, 512000, 64)
mp_∂Ny = CUDA.zeros(Float64, 512000, 64)
mp_∂Nz = CUDA.zeros(Float64, 512000, 64)
mp_pos = CUDA.rand(Float64, 512000, 3)
grid_pos = CUDA.rand(Float64, 710733, 3)
test1!(CUDABackend())(ndrange=512000, mp_p2n, mp_Ni, mp_∂Nx, mp_∂Ny, mp_∂Nz, mp_pos, grid_pos)
time = 1e3 * @belapsed begin
test1!($CUDABackend())(ndrange=$512000, $mp_p2n, $mp_Ni, $mp_∂Nx, $mp_∂Ny, $mp_∂Nz, $mp_pos, $grid_pos)
CUDA.synchronize()
end
@ZenanH
Copy link
Author

ZenanH commented Mar 6, 2024

On the RTX3090, this code takes around 20 ms. However, if we add one line like this:

@kernel inbounds = true function test1!(mp_p2n, mp_Ni, mp_∂Nx, mp_∂Ny, mp_∂Nz, mp_pos, grid_pos)
    ix = @index(Global)
    for iy in Int32(1):Int(64)
        p2n = mp_p2n[ix, iy]
        Nx, dNx = uGIMPbasis(mp_pos[ix, 1]-grid_pos[p2n, 1], Float64(2.0), Float64(1.0))
        Ny, dNy = uGIMPbasis(mp_pos[ix, 2]-grid_pos[p2n, 2], Float64(3.0), Float64(2.0))
        Nz, dNz = uGIMPbasis(mp_pos[ix, 3]-grid_pos[p2n, 3], Float64(4.0), Float64(3.0))
        Nx = dNx = Ny = dNy = Nz = dNz = Float64(2)
        mp_Ni[ix, iy] = Nx * Ny * Nz
        mp_∂Nx[ix, iy] = dNx * Ny * Nz
        mp_∂Ny[ix, iy] = dNy * Nx * Nz
        mp_∂Nz[ix, iy] = dNz * Nx * Ny
    end
end

then it only takes 1.24 ms

@ZenanH
Copy link
Author

ZenanH commented Mar 6, 2024

It happens on both of It happens on both of FP32 and FP64.

@ZenanH
Copy link
Author

ZenanH commented Mar 6, 2024

On the RTX 3090
3d_collapse_benchmark

@ZenanH
Copy link
Author

ZenanH commented Mar 6, 2024

V100 is better:
profile_vis

@ZenanH
Copy link
Author

ZenanH commented Mar 12, 2024

😅 I cannot know if the compiler ignores the function invocations from code_llvm, the left one has Nx = dNx = Ny = dNy = Nz = dNz = Float64(2), the right one doesn't have.

2024-03-12_165529


But I try to save the results into an Array: test like this:

@kernel inbounds = true function test1!(mp_p2n, mp_Ni, mp_∂Nx, mp_∂Ny, mp_∂Nz, mp_pos, grid_pos, test)
    ix = @index(Global)
    for iy in Int32(1):Int(64)
        p2n = mp_p2n[ix, iy]
        Nx, dNx = uGIMPbasis(mp_pos[ix, 1]-grid_pos[p2n, 1], Float64(2.0), Float64(1.0))
        Ny, dNy = uGIMPbasis(mp_pos[ix, 2]-grid_pos[p2n, 2], Float64(3.0), Float64(2.0))
        Nz, dNz = uGIMPbasis(mp_pos[ix, 3]-grid_pos[p2n, 3], Float64(4.0), Float64(3.0))

        # save the results
        test[ix] = Nx * Ny * Nz + dNx + dNy + dNz
        

        Nx = dNx = Ny = dNy = Nz = dNz = Float64(2)
        mp_Ni[ix, iy] = Nx * Ny * Nz
        mp_∂Nx[ix, iy] = dNx * Ny * Nz
        mp_∂Ny[ix, iy] = dNy * Nx * Nz
        mp_∂Nz[ix, iy] = dNz * Nx * Ny
    end
end

Then, it takes 20 ms.

That means, as you said, it didn't call the function uGIMPbasis().

With the dataset I used above, the registers used are around 83.

@ZenanH
Copy link
Author

ZenanH commented Mar 12, 2024

I changed the uGIMPbasis() as follow:

@inline Base.@propagate_inbounds function uGIMPbasis1(Δx::T2, h::T2, lp::T2) where T2
    T2 == Float32 ? T1 = Int32 :
    T2 == Float64 ? T1 = Int64 : nothing
    absΔx = abs(Δx); signΔx = sign(Δx)
    if absΔx<(T2(0.5)*lp)
        N = T2(1.0)-((T2(4)*(Δx^T1(2))+lp^T1(2))/(T2(4)*h*lp))
        dN = -((T2(8.0)*Δx)/(T2(4)*h*lp))
    elseif (T2(0.5)*lp)absΔx<(h-T2(0.5)*lp)
        N = T2(1.0)-(absΔx/h)
        dN = signΔx*(-T2(1.0)/h)
    elseif (h-T2(0.5)*lp)absΔx<(h+T2(0.5)*lp)
        N = ((h+T2(0.5)*lp-absΔx)^T1(2))/(T2(2.0)*h*lp)
        dN = -signΔx*((h+T2(0.5)*lp-absΔx)/(h*lp))
    else
        N = T2(0.0)
        dN = T2(0.0)
    end
    return T2(N), T2(dN)
end

Now it takes 10 ms, only half the time of the previous one with 74 registers.

I will keep trying to reduce the number of registers😎. @luraess

@luraess
Copy link

luraess commented Mar 12, 2024

I guess now it may take some time because of branches. But cool timing improves

@ZenanH
Copy link
Author

ZenanH commented Apr 29, 2024

Hi @utkinis 👋! Thanks for the disscussion during EGU.

If I understand well, to improve the efficiency of my kernel function (to compute and save basis function), two things are worth to do:

  1. transpose the Array Ni, ∂Nx, ∂Ny, ∂Nx
  2. unroll the for loop

For point 1 ❓

I did a simple test below, however it is slower than the original version. I tried the same method on my code, but it also slower.
By the way, I also tried to use a 2d block for CUDA launch configuration, it is much slower than a 1d block.

using KernelAbstractions
using CUDA
using BenchmarkTools
using Printf

# without transpose
@kernel inbounds=true function kernel1!(a, b, c)
    ix = @index(Global)
    for iy in 1:64
        a[ix, iy] = b[ix, iy] + c[ix, iy] - a[ix, iy] ^ 2
        b[ix, iy] = a[ix, iy] + c[ix, iy] - b[ix, iy] ^ 2
        c[ix, iy] = a[ix, iy] + b[ix, iy] - c[ix, iy] ^ 2
    end
end

@kernel inbounds=true function kernel2!(a, b, c)
    ix = @index(Global)
    for iy in 1:64
        a[iy, ix] = b[iy, ix] + c[iy, ix] - a[iy, ix] ^ 2
        b[iy, ix] = a[iy, ix] + c[iy, ix] - b[iy, ix] ^ 2
        c[iy, ix] = a[iy, ix] + b[iy, ix] - c[iy, ix] ^ 2
    end
end

n = 512000
a = CUDA.rand(Float64, n, 64)
b = CUDA.rand(Float64, n, 64)
c = CUDA.rand(Float64, n, 64)
d = CUDA.rand(Float64, 64, n)
e = CUDA.rand(Float64, 64, n)
f = CUDA.rand(Float64, 64, n)
dev = CUDABackend()

rst1 = @belapsed begin 
    kernel1!($dev)(ndrange=$n, $a, $b, $c)
    KernelAbstractions.synchronize($dev)
end

rst2 = @belapsed begin 
    kernel2!($dev)(ndrange=$n, $d, $e, $f)
    KernelAbstractions.synchronize($dev)
end

@info """\t
rst1: $(@sprintf("%.2e ms", rst1*1e3))
rst2: $(@sprintf("%.2e ms", rst2*1e3))
"""

output:

┌ Info: 
│ rst1: 2.03e+00 ms
└ rst2: 2.93e+01 ms

For point 2

Unrolling the for-loop by using Base.Cartesian.@nexprs reduced the computing time indeed, around 20%. 🎉

@utkinis
Copy link

utkinis commented May 6, 2024

Hi @ZenanH ! Thanks for the follow-up, cool that you could make a progress with implementation and a speedup with loop unrolling =)

As for the point 1, I guess it makes sense to figure out exactly what happens here. I've made a version of your example with some extra kernels:

using KernelAbstractions
using CUDA
using BenchmarkTools
using Printf

# without transpose
@kernel inbounds = true function kernel1!(a, b, c)
    ix = @index(Global, Linear)
    for iy in 1:64
        a[ix, iy] = b[ix, iy] + c[ix, iy] - a[ix, iy]^2
        b[ix, iy] = a[ix, iy] + c[ix, iy] - b[ix, iy]^2
        c[ix, iy] = a[ix, iy] + b[ix, iy] - c[ix, iy]^2
    end
end

@kernel inbounds = true function kernel2!(a, b, c)
    ix = @index(Global, Linear)
    for iy in 1:64
        a[iy, ix] = b[iy, ix] + c[iy, ix] - a[iy, ix]^2
        b[iy, ix] = a[iy, ix] + c[iy, ix] - b[iy, ix]^2
        c[iy, ix] = a[iy, ix] + b[iy, ix] - c[iy, ix]^2
    end
end

# with extra parallelism
# without transpose
@kernel inbounds = true function kernel3!(a, b, c)
    ix = @index(Group, Linear)
    iy = @index(Local, Linear)
    a[ix, iy] = b[ix, iy] + c[ix, iy] - a[ix, iy]^2
    b[ix, iy] = a[ix, iy] + c[ix, iy] - b[ix, iy]^2
    c[ix, iy] = a[ix, iy] + b[ix, iy] - c[ix, iy]^2
end

@kernel inbounds = true function kernel4!(a, b, c)
    ix = @index(Group, Linear)
    iy = @index(Local, Linear)
    a[iy, ix] = b[iy, ix] + c[iy, ix] - a[iy, ix]^2
    b[iy, ix] = a[iy, ix] + c[iy, ix] - b[iy, ix]^2
    c[iy, ix] = a[iy, ix] + b[iy, ix] - c[iy, ix]^2
end

# grid-stride version (or in this case, rather block-stride)
@kernel inbounds = true function kernel5!(a, b, c)
    ix = @index(Group, Linear)
    tx = @index(Local, Linear)
    stride = prod(@groupsize())
    for iy in tx:stride:64
        a[iy, ix] = b[iy, ix] + c[iy, ix] - a[iy, ix]^2
        b[iy, ix] = a[iy, ix] + c[iy, ix] - b[iy, ix]^2
        c[iy, ix] = a[iy, ix] + b[iy, ix] - c[iy, ix]^2
    end
end

n = 512000
a = CUDA.rand(Float64, n, 64)
b = CUDA.rand(Float64, n, 64)
c = CUDA.rand(Float64, n, 64)
d = CUDA.rand(Float64, 64, n)
e = CUDA.rand(Float64, 64, n)
f = CUDA.rand(Float64, 64, n)
dev = CUDABackend()

rst1 = @belapsed begin
    kernel1!($dev)($a, $b, $c; ndrange=$n)
    KernelAbstractions.synchronize($dev)
end

rst2 = @belapsed begin
    kernel2!($dev)($d, $e, $f; ndrange=$n)
    KernelAbstractions.synchronize($dev)
end

# group size is equal to number of elements in the first dimension
wg = 64

# worksize is total number of elements in the array
ws = wg * n

rst3 = @belapsed begin
    kernel3!($dev, $wg)($a, $b, $c; ndrange=$ws)
    KernelAbstractions.synchronize($dev)
end

rst4 = @belapsed begin
    kernel4!($dev, $wg)($d, $e, $f; ndrange=$ws)
    KernelAbstractions.synchronize($dev)
end

# groupsize is defined by ourselves (a bit more flexibility)
wg = 32

# worksize is defined by block size and number of elements
ws = wg * n

rst5 = @belapsed begin
    kernel5!($dev, $wg)($d, $e, $f; ndrange=$ws)
    KernelAbstractions.synchronize($dev)
end

@info """\t
rst1 (   loop,  no transpose): $(@sprintf("%.2e ms", rst1*1e3))
rst2 (   loop,     transpose): $(@sprintf("%.2e ms", rst2*1e3))
rst3 (no loop,  no transpose): $(@sprintf("%.2e ms", rst3*1e3))
rst4 (no loop,     transpose): $(@sprintf("%.2e ms", rst4*1e3))
rst5 (grid-stride, transpose): $(@sprintf("%.2e ms", rst5*1e3))
"""

which gives the following output on A100 NVidia GPU:

┌ Info: 
│ rst1 (   loop,  no transpose): 1.22e+00 ms
│ rst2 (   loop,     transpose): 1.64e+01 ms
│ rst3 (no loop,  no transpose): 1.33e+01 ms
│ rst4 (no loop,     transpose): 1.18e+00 ms
└ rst5 (grid-stride, transpose): 1.18e+00 ms

Here's what happens in my opinion (ideally, one need to profile the code with Nsight, which I didn't do, so take it with a grain of salt):

For optimal performance on GPUs it is crucial that the global memory accesses are coalesced (see here for more info). To achieve that, we need to make sure that the consecutive threads (or work items in AMD/KernelAbstractions jargon) access consecutive array indices (there's more to that, but I'm not really an expert in GPU optimisation tbh and not the best person to explain this things). What happens in the transposed version of your code, i.e., kernel2!, is that e.g., on the first iteration of the inner loop, when, iy = 1, thread 1 accesses a[1,1], thread 2 accesses a[1,2] etc. Since Julia is column-major, that means that the neighbour threads access array elements that are 64 elements apart from each other. This is clearly not optimal for GPU access.

Why then the first version (kernel1!) is faster? I think that KernelAbstraction picks the block size (same as group size) that is proportional to the inner loop size, i.e., 64. Usually the preferred block size is proportional to the warp size, which is 32 on all Nvidia devices. If the picked block size is equal to 64, then the thread 1 accesses a[1,1] and thread 2 accesses a[2,1], which is optimal access pattern. I didn't think of it when we had this discussion at EGU, sorry =)

Now, one solution would be to always fix the number of threads in a block to 64 and use kernel1!, which is perfectly fine in your case. However, If the size of the inner loop is not 64 but something else you might experience sudden performance issues on some architectures. Another possibility is to manually control the group size and enforce correct memory access pattern. In kernels kernel3! and kernel4! I do exactly this by launching n groups each having 64 threads. Then I use the local thread index instead of inner loop. As you can see from benchmarking, now we have exactly the opposite situation, with the transposed matrix being faster, and a bit faster than the old version as well.

Finally, to decouple work size from the inner loop size, we can use something similar to the "grid-stride loop" (see here for more info). In kernel 5 I apply this only to the first dimension in the array, however, it can be used for both dimensions, and then if for some reason you'll have more or less neighbour weights then 64, you can change it independently from the block size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment