Just How Slow is Copy-To-RAM?

So, it’s been established in other threads that creating an offscreen render-to-texture in copy-to-RAM mode is very slow.

However, trying it myself, I find myself nevertheless surprised at just how slow: I’m seeing ~10ms to copy a 16x16 texture!

Is… is that really how slow copy-to-RAM is? Or might I be doing something that’s making it slower than it might be? Something in my framebuffer properties, or texture setup, or something…?

(I do think that I want copy-to-RAM for the feature that I’m working on–at least if I continue with the approach that I’m taking. I’m reading from the texture in Python, currently every frame, and using the results for some game-logic.)

The actual transfer isn’t slow, it’s actually pretty fast. The problem is that it requires a so-called pipeline stall. Let me explain.

You should see the rendering pipeline as a production line with a long conveyor belt: on one end is your application, putting objects, their positions and textures on the belt. There are various rendering stages they will pass through: Panda’s cull, draw, there’s the driver’s stages and then the conveyor belt continues onto the GPU which also has the work pass through various stages (vertex shading, rasterization, fragment shading, blending, etc.) before they appear on the user’s monitor at the very end.

This model generally works pretty great, because each stage has its own hardware (whether it be a CPU thread or a dedicated bit of GPU hardware designed for a single task) which can be working on different data at the same time. You just need to keep putting objects on the belt and they will appear on screen eventually. Maximum performance is achieved when the belt is constantly full and each piece of hardware is always busy, which may mean that the application can even be sometimes a few frames ahead of what’s actually currently shown on screen.

(This analogy holds for both the single-threaded and multi-threaded pipeline; the latter just splits the one Panda rendering stage into several that are on different threads.)

The frame time you see in PStats is actually the time it takes to finish putting a frame’s objects onto the belt. It has little to do with how long it takes for those things to actually appear on screen. That’s actually known as the latency and it may be significantly longer than the time it takes to render a frame, because there are many processes working in parallel, each on a different object.

You may now see the problem with a to-RAM copy that you then need to process on the CPU. You just finished putting the objects for a frame onto the belt, but then you immediately ask to see what’s shown on screen. But it’ll still take a long time for those objects to reach the end of the belt!

So Panda has no choice but to wait until the last object on the belt has completely passed through the entire factory to the end of the line, before some factory worker can look at what’s shown on screen and run back to the application with the results. So you have to wait for the entire latency time plus the time it takes to run back with the results. In the meantime, no new work is being placed onto the belt, and you’ve essentially just wasted a ton of processing power and time.

For some applications, the latency is relatively short, so it does not affect performance much. For most others, when there is a lot of rendering work to do, it can be devastating to performance.

The way to work around this is usually to try to make sure that whatever needs the rendered texture is further downstream on the conveyor belt, so that there is no such stall. Or, ensure that enough work can still be placed onto the belt in the meantime (like getting the results from a frame ago, so that the next frame can still be rendered, just based on old information).

Could you perhaps clarify your use case so that maybe I can suggest a more specific solution?

Ah, I see–interesting! Thank you for that explanation!

As to my use-case, it is this:

In a certain region within my project, there’s a lot of foliage. (A lot.)

Now, I want that foliage to move as the player passes through it, and to that end I have a set of shaders that essentially render the player’s velocity to a texture, and then use that texture to move the vertices of the foliage.

Up to that point, all works well, I feel!

But note that this only covers the visual aspect of the foliage. And foliage is generally not silent when moved.

So I’ve been looking for a way to associate foliage movement with noise. And what I’ve come up with is this:

An additional render-to-texture is performed, rendering only the foliage, using a cut-down version of the foliage-shader that renders only the “intensity” of the movement, and that renders to a small texture. This texture is then sampled on the Python-side to look for intensities above a certain threshold, in response to which it spawns sounds at the indicated positions.

This works–it’s rough at the moment, in particular lacking a means to moderate how many sounds are spawned–but it is horribly slow.

(I do have a backup approach on hold, which simply uses colliders to register the player’s presence “near foliage”. This also works, but only produces sound for the player’s position (i.e. ignoring propagation of movement within the foliage).)

Ah. For sound, you probably don’t need an immediate response. You could use the async screenshot feature. Don’t bind a texture with RTM_copy_ram; instead, call get_async_screenshot() on the buffer, which returns an AsyncFuture that will eventually resolve with the downloaded texture data. Use it with await in a coroutine task, set a done event on it or use add_done_callback to get the result when it’s done. It may be a few frames delayed, but this will probably not be noticeable for sound, and it won’t introduce a pipeline stall.

You could call it every frame (keeping in mind that you would have multiple requests running concurrently) or you could decide to call it periodically if you don’t need it to update every frame, which may be simpler because you could just have a loop like so:

async def update():
    while True:
        tex = await buffer.get_async_screenshot()
        volume = determine_volume(tex.peek())
        rustling_leaves_sound.set_volume(volume)

taskMgr.add(update())
1 Like

Hmm… An interesting idea, thank you! I’ll likely give it a shot! :slight_smile:

[edit] Ah- Is “get_async_screenshot” a feature new to 1.11…? I’m still on 1.10, and it looks like the method isn’t available there…

In which case–for now, at least–it might be better that I fall back on my backup approach. I’ll save the code, in case I want it when 1.11 comes up, I intend!