The actual transfer isn’t slow, it’s actually pretty fast. The problem is that it requires a so-called pipeline stall. Let me explain.
You should see the rendering pipeline as a production line with a long conveyor belt: on one end is your application, putting objects, their positions and textures on the belt. There are various rendering stages they will pass through: Panda’s cull, draw, there’s the driver’s stages and then the conveyor belt continues onto the GPU which also has the work pass through various stages (vertex shading, rasterization, fragment shading, blending, etc.) before they appear on the user’s monitor at the very end.
This model generally works pretty great, because each stage has its own hardware (whether it be a CPU thread or a dedicated bit of GPU hardware designed for a single task) which can be working on different data at the same time. You just need to keep putting objects on the belt and they will appear on screen eventually. Maximum performance is achieved when the belt is constantly full and each piece of hardware is always busy, which may mean that the application can even be sometimes a few frames ahead of what’s actually currently shown on screen.
(This analogy holds for both the single-threaded and multi-threaded pipeline; the latter just splits the one Panda rendering stage into several that are on different threads.)
The frame time you see in PStats is actually the time it takes to finish putting a frame’s objects onto the belt. It has little to do with how long it takes for those things to actually appear on screen. That’s actually known as the latency and it may be significantly longer than the time it takes to render a frame, because there are many processes working in parallel, each on a different object.
You may now see the problem with a to-RAM copy that you then need to process on the CPU. You just finished putting the objects for a frame onto the belt, but then you immediately ask to see what’s shown on screen. But it’ll still take a long time for those objects to reach the end of the belt!
So Panda has no choice but to wait until the last object on the belt has completely passed through the entire factory to the end of the line, before some factory worker can look at what’s shown on screen and run back to the application with the results. So you have to wait for the entire latency time plus the time it takes to run back with the results. In the meantime, no new work is being placed onto the belt, and you’ve essentially just wasted a ton of processing power and time.
For some applications, the latency is relatively short, so it does not affect performance much. For most others, when there is a lot of rendering work to do, it can be devastating to performance.
The way to work around this is usually to try to make sure that whatever needs the rendered texture is further downstream on the conveyor belt, so that there is no such stall. Or, ensure that enough work can still be placed onto the belt in the meantime (like getting the results from a frame ago, so that the next frame can still be rendered, just based on old information).
Could you perhaps clarify your use case so that maybe I can suggest a more specific solution?