Thread by @meteotrix, caffeine reset today, though I'll still try to write that section on [...]

caffeine reset today, though I& #39;ll still try to write that section on compute shaders tonight :3

laundry to do, walk, maybe some cleaning too

compute shaders, how many things need explaining hm:
-kernels
-dispatches
-ram<>vram data formats, transfer & render textures
-render pipeline & command buffer stuff? might be specific to unity
-syntax? not for the thesis report, but could be useful for beginners

general theory is easy to explain, it& #39;s the syntax & engine specifics that get funky :>

a shader is a script that runs on a GPU: typically vertex & fragment shaders perform rendering operations

a *compute* shader however is a script that lets you run code on the gpu that isn& #39;t necessarily rendering-related

a kernel is what functions inside compute shaders are called, and they& #39;re typically written to target array chunks instead of entire arrays

these chunks are defined by a group of indices used to access a group of contiguous buffer/array cells, typically to get the benefits of tiled rendering/computing

kernels aren& #39;t just run once per call, instead, many threads get "dispatched" with your kernel/function, each on a separate chunk of the data.

A typical example: you& #39;re doing an operation on the pixels of a texture that is 512x512. Your kernel might run on chunks of 16x16 pixels, which cuts up the work load into 1024 chunks, which means 1024 threads will be dispatched to run the kernel of each of those chunks :3

it depends on how many threads your specific GPU can run in parallel, but if it can run 1024 threads in parallel on those chunks of that size, you just sped up the code 1024 times (I think).

Maybe your specific gpu could do even better and run on chunks of 8x8 and run 4096 threads at once, which would speed it even harder. Though I think on my own machine (1050Ti), I get best results with chunks of 32x32 pixels and like 256 threads in parallel for a 512x512 texture.

I think it& #39;s because it has 768 gpu cores, so doing 1024 threads is too much? not completely sure x)

the last main thing I think I have to explain is the fact the data the compute shaders run on has to be in gpu memory/VRAM, and not cpu memory/RAM.

that has upsides and downsides:

if you NEED the data to go back and forth between ram & vram, you& #39;re going to need to transfer it, which is generally slow as hecc

but if you& #39;re in a scenario where you have the choice between going back & forth a lot, VERSUS generating data in vram directly with a compute shader for exclusive use in later rendering, that can save a ton of perf

I mentioned tiled rendering, I might have to explain the idea. You gotta compare it to "scanline rendering", that renders/processes one row of pixels at a time

math on image pixels close together typically require similar data, while math on image pixels at opposite ends of a row might be looking at completely different data

like, if on the left side of the screen you have a wall, and on the right side you have the sky, you need less data to do the rendering if you& #39;re JUST doing a tile to the left or the right, while if you render a row, you might need to load/reload both the wall data and sky data

I also mentioned that compute shaders run on array chunks instead of entire arrays

that might confuse some beginners, because obviously not every function processes arrays in the first place

the /point/ of using the gpu as a non-graphics processing unit is that it& #39;s very good at processing tons of similar data in parallel, so running data on arrays that contain high volumes of that can be split cleanly is the main use case for compute shaders.

the cpu on the other hand, has less a ... hmrf, explanation works better in french x) (mfw google trad pronounces pelleteuse "PLELTEUSE" for some reason ?w?)

the gpu is like a truck that can excavate big volumes in one location, while the cpu is more like a team of 20 people with shovels, they can& #39;t shovel as hard, but you can move them around heterogeneous workloads faster

pic1: gpu
pic2: cpu

btw if there& #39;s anything wrong in this thread feel free to correct me :3

Latest Threads Unrolled: