Skip to content

Instantly share code, notes, and snippets.

View troelsy's full-sized avatar

Troels Ynddal troelsy

  • Denmark
View GitHub Profile
#include <iostream>
#include <cuda_runtime_api.h>
#include <stdint.h>
#include <cub/cub.cuh>
template <uint32_t block_size>
__global__ void grid_stride_reduce(uint32_t *array, uint32_t *tmp_array, uint32_t length){
uint32_t thread_index = threadIdx.x;
uint32_t global_index = blockIdx.x * (block_size * 2) + thread_index;
@troelsy
troelsy / grid-stride-loop-reduction.cu
Created June 10, 2025 13:03
I recently reread "Optimizing Parallel Reduction in CUDA" by Mark Harris and was wondering if it was still considered the best solution. As far as I can tell the solution is written for CC ≤1.3 and today CC=12.0
#include <iostream>
#include <cuda_runtime_api.h>
#include <stdint.h>
#define FULL_MASK 0xFFFFFFFF
#define WARP_SIZE 32
__device__ __forceinline__ uint32_t lane_id(uint32_t tid){
// https://stackoverflow.com/q/44337309