wuyakuma / GPUOptimizationForGameDev.md

Created June 11, 2025 07:38 — forked from silvesthu/GPUOptimizationForGameDev.md

GPU Optimization for GameDev

Graphics Pipeline / GPU Architecture Overview

2011 - A trip through the Graphics Pipeline 2011
2015 - Life of a triangle - NVIDIA's logical pipeline
2015 - Render Hell 2.0
2016 - How bad are small triangles on GPU and why?
2017 - GPU Performance for Game Artists
2019 - Understanding the anatomy of GPUs using Pokémon
2020 - GPU ARCHITECTURE RESOURCES

wuyakuma / InfiniteGrid.shader

Created October 13, 2023 01:22 — forked from bgolus/InfiniteGrid.shader

Infinite Grid shader with procedural grid with configurable divisions and major and minor lines markings.

	Shader "Unlit/InfiniteGrid"
	{
	Properties
	{
	[Toggle] _WorldUV ("Use World Space UV", Float) = 1.0
	_GridScale ("Grid Scale", Float) = 1.0
	_GridBias ("Grid Bias", Float) = 0.5
	_GridDiv ("Grid Divisions", Float) = 10.0
	_BaseColor ("Base Color", Color) = (0,0,0,1)
	_LineColor ("Line Color", Color) = (1,1,1,1)

wuyakuma / Simulation_Projection.md

Created April 12, 2022 13:39 — forked from vassvik/Simulation_Projection.md

Realtime Fluid Simulation: Projection

The core of most real-time fluid simulators, like the one in EmberGen, are based on the "Stable Fluids" algorithm by Jos Stam, which to my knowledge was first presented at SIGGRAPH '99. This is a post about one part of this algorithm that's often underestimated: Projection

MG4_F32.mp4

Stable Fluids

The Stable Fluids algorithm solves a subset of the famous "Navier Stokes equations", which describe how fluids interact and move. In particular, it typically solves what's called the "incompressible Euler equations", where viscous forces are often ignored.

wuyakuma / FastUniformLoadWithWaveOps.txt

Created November 26, 2018 07:25 — forked from sebbbi/FastUniformLoadWithWaveOps.txt

Fast uniform load with wave ops (up to 64x speedup)

	In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
	group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

	Simplified HLSL code looks like this:

	Buffer<float4> lightDatas;
	Texture2D<uint2> lightStartCounts;
	RWTexture2D<float4> output;

	[numthreads(8, 8, 1)]

wuyakuma / dfg_cloth.glsl

Created January 5, 2018 03:18

	vec2 DFG_Cloth(float roughness, float NoV) {
	const vec4 c0 = vec4(0.24, 0.93, 0.01, 0.20);
	const vec4 c1 = vec4(2.00, -1.30, 0.40, 0.03);

	float s = 1.0 - NoV;
	float e = s - c0.y;
	float g = c0.x * exp2(-(e * e) / (2.0 * c0.z)) + s * c0.w;
	float n = roughness * c1.x + c1.y;
	float r = max(1.0 - n * n, c1.z) * g;

wuyakuma / dfg_cloth.glsl

Created January 5, 2018 03:18

	vec2 DFG_Cloth(float roughness, float NoV) {
	const vec4 c0 = vec4(0.24, 0.93, 0.01, 0.20);
	const vec4 c1 = vec4(2.00, -1.30, 0.40, 0.03);

	float s = 1.0 - NoV;
	float e = s - c0.y;
	float g = c0.x * exp2(-(e * e) / (2.0 * c0.z)) + s * c0.w;
	float n = roughness * c1.x + c1.y;
	float r = max(1.0 - n * n, c1.z) * g;

wuyakuma / hash_fnv1a.h

Created November 6, 2017 05:35 — forked from ruby0x1/hash_fnv1a.h

FNV1a c++11 constexpr compile time hash functions, 32 and 64 bit

	#pragma once
	#include <stdint.h>

	//fnv1a 32 and 64 bit hash functions
	// key is the data to hash, len is the size of the data (or how much of it to hash against)
	// code license: public domain or equivalent
	// post: https://notes.underscorediscovery.com/constexpr-fnv1a/

	inline const uint32_t hash_32_fnv1a(const void* key, const uint32_t len) {

wuyakuma / d_ggx.glsl

Created September 6, 2017 02:47 — forked from romainguy/d_ggx.glsl

D_GGX in mediump/half float

	float D_GGX(float linearRoughness, float NoH, const vec3 h) {
	// Walter et al. 2007, "Microfacet Models for Refraction through Rough Surfaces"

	// In mediump, there are two problems computing 1.0 - NoH^2
	// 1) 1.0 - NoH^2 suffers floating point cancellation when NoH^2 is close to 1 (highlights)
	// 2) NoH doesn't have enough precision around 1.0
	// Both problem can be fixed by computing 1-NoH^2 in highp and providing NoH in highp as well

	// However, we can do better using Lagrange's identity:
	// \|\|a x b\|\|^2 = \|\|a\|\|^2 \|\|b\|\|^2 - (a . b)^2

wuyakuma / gpu_arch_resources

Created March 6, 2017 06:12 — forked from jhaberstro/gpu_arch_resources

GPU Architecture Learning Resources

	http://courses.cms.caltech.edu/cs179/
	http://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
	https://community.arm.com/graphics/b/blog
	http://cdn.imgtec.com/sdk-documentation/PowerVR+Hardware.Architecture+Overview+for+Developers.pdf
	http://cdn.imgtec.com/sdk-documentation/PowerVR+Series5.Architecture+Guide+for+Developers.pdf
	https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
	https://www.imgtec.com/blog/the-dr-in-tbdr-deferred-rendering-in-rogue/
	http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-412605
	https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/
	https://community.arm.com/graphics/b/documents/posts/moving-mobile-graphics#siggraph2015

wuyakuma / Tex2DCatmullRom.hlsl

Created September 20, 2016 03:33 — forked from TheRealMJP/Tex2DCatmullRom.hlsl

An HLSL function for sampling a 2D texture with Catmull-Rom filtering, using 9 texture samples instead of 16

	// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16.
	// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details
	float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize)
	{
	// We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding
	// down the sample location to get the exact center of our "starting" texel. The starting texel will be at
	// location [1, 1] in the grid, where [0, 0] is the top left corner.
	float2 samplePos = uv * texSize;
	float2 texPos1 = floor(samplePos - 0.5f) + 0.5f;