Troels Ynddal troelsy

M.Sc. Computer Science, University of Copenhagen

17 followers · 16 following

Denmark

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

troelsy / grid-stride-loop-reduction-cub.cu

Created November 4, 2025 11:39

Implementation found in https://gist.github.com/troelsy/fff6aac2226e080dcebf05531a11d44e using CUDA CUB

	#include <iostream>
	#include <cuda_runtime_api.h>
	#include <stdint.h>
	#include <cub/cub.cuh>


	template <uint32_t block_size>
	__global__ void grid_stride_reduce(uint32_t array, uint32_t tmp_array, uint32_t length){
	uint32_t thread_index = threadIdx.x;
	uint32_t global_index = blockIdx.x * (block_size * 2) + thread_index;

troelsy / grid-stride-loop-reduction.cu

Created June 10, 2025 13:03

I recently reread "Optimizing Parallel Reduction in CUDA" by Mark Harris and was wondering if it was still considered the best solution. As far as I can tell the solution is written for CC ≤1.3 and today CC=12.0

	#include <iostream>
	#include <cuda_runtime_api.h>
	#include <stdint.h>


	#define FULL_MASK 0xFFFFFFFF
	#define WARP_SIZE 32

	__device__ __forceinline__ uint32_t lane_id(uint32_t tid){
	// https://stackoverflow.com/q/44337309