My problem is the following. I want to use GPU shaders to do some calculations on the image (Sobel operator, standard deviation, stuff like that). I’m not so much interested in displaying the image (although it can be a nice visualisation) as in calculating the value. I know I should basically use compute shaders for this, but AFAIK OpenGL on macOS (my environment) has no support for compute shaders (at least in Panda3D), so I use cascade 3 fragment shaders for this. In the output I have a value
, just float
, that I want to pass back to Panda3D. My case is just a Proof-of-Concept to show how fast GPU computing works, so it doesn’t have to be done entirely according to best practice.
In this case, the simplest option to get a certain value
seems to be to color a certain pixel with vec4(value)
in the final shader and then sample it. The problem is how to do it.
For me, it doesn’t matter much whether I’m sampling a texture or an already displayed image (I spread the texture on a quad that fills the entire screen). Wish it was quick. No copying the entire texture to RAM - it kinda doesn’t make sense to me - remind you, I only need the value of one pixel. But I want to sample it quickly, optimally, once per frame.
Do you have any ideas?
I’d suggest you use imageStore to store to a separate 1x1 texture, but alas, that’s an OpenGL 4.2 feature.
Can’t you just use this shader with an offscreen buffer that has dimensions 1x1?
Otherwise, you can create a DisplayRegion covering only one pixel of the buffer/window. Then call getScreenshot on that. That should get you just the one pixel.
Thank you for your answer.
I tried the second method (DisplayRegion
with sizes LVecBase2i(1, 1)
). Generally - it worked. Unfortunately, it’s not very fast. Each getScreenshot()
followed by getXel()
takes about 0.4 ms in total. Not much, but if you notice that all my calculations on the GPU last 0.8 ms, it means that the result download introduces about 50% of the overhead.
I haven’t tried the second way (an offscreen buffer that has dimensions 1x1). I don’t know if it could be done at all - after all, I do my calculations starting from the full image and moving in a cascade of 3 fullscreen shaders. Only the last shader calculates one float
from the whole image. So I guess I have to maintain the fullscreen buffer, right?
The graphics pipeline is heavily, well, pipelined. That means that any attempt to read back data requires flushing any pending commands, and draining the pipe, waiting for them to be done. Most of the time you’re seeing is probably not due to the transfer, but due to the pipeline stall.
The only way to get around this is to get the screenshot asynchronously, meaning it will be copied in the background and you get a notification (callback, event or resumption of a coroutine) when it’s done. This may be several frames later, however, so it’s only useful if you don’t need the result right away.
If you want to use this, you will have to use a development build of Panda, which has support for getting a screenshot asynchronously.
As an aside, getting the screenshot as a Texture and using .peek()
on that texture is more efficient than getting it via PNMImage.
Thank you.
I understand.
Actually for me it is enough to probe the colour directly from the texture, I don’t need to make a screenshot.
The solution with .peek()
you provided is actually pretty fast. But the problem is .peek()
method returns a TexturePeeker
object only when launched on an original texture. But for textures rendered by FilterManager
(e.g. by renderQuadInto(colortex=some_tex)
), I receive NoneType
(even if the textures display perfectly). Do you know what could be the reason?
When I say a “screenshot”, I mean a copy of the framebuffer contents in the form of a texture. It sounds like that is what you want. You can’t just just “probe directly from the texture” on the CPU if the data is still on the GPU.
That means peek()
will only work if the data is on the CPU. Without taking a screenshot, you won’t get any data. That’s why you need to use one of the methods to download the texture from GPU memory to CPU RAM, which includes:
- getAsyncScreenshot() (highest throughput)
- addRenderTexture() with RTMCopyRam or RTMTriggeredCopyRam (lowest latency)
- getScreenshot() (slower)
- Manual call to base.graphicsEngine.extractTextureData (slower)
Thanks for the clarification. I understood it now. It is logical.
So for now, I’ll leave it as is. As I mentioned, I’m thinking about switching to compute shaders (and other hardware) anyway, where the problem of data transfer is solved differently anyway. My work for now was mainly about evaluating processing speed, which I managed to estimate.
Thanks again for all your help!