This is kind of a crunchy problem for kind of a crunchy application. Basically, I’m experimenting with multiple-rendering for a stereoscopic application (it’s kinda cool- I may try to sneak my code out of my workplace and back to the ETC if I get NDA clearance). I have 5 cameras rendering the same scene from different angles, and a shader compositing the camera buffers for final viewing (display-“proprietary” sub-pixel multiplexing of sorts).
Naturally, for any scene worth anything, the code chugs real bad. However, I note that only one of my CPU cores is at 100%, the others are mostly idle, making me think that I’m bottlenecking the CPU before the graphics card. The camera transformations and texture ops are likely the bulk of this overhead, and with the inherent ghosting/bleeding of the display, I’m guessing I could get away with the individual camera render buffers being a little out-of-sync so long as they always contained something the shader could access.
Is the Panda render pipeline at all compatible with asynchronous rendering, and if so, is there a straightforward way to make it happen without, e.g., hacking down into the core camera code and patching it up with my own haphazard threading?
[Panda3D 1.6.2; XP Pro SP2; Core2 quad 2.39GHz; 3.25GB RAM]
One of the goals of Panda’s pipeline design has always been to support asynchronous rendering of this form, and the groundwork has been laid for this for a long time, but the needed polish is not yet there.
Thomas’ point is well-made, though. Pegging your CPU utilization has little to do with proving your CPU is a bottleneck. You’ll need to do some in-depth analysis using PStats, and some basic detective work, to prove this definitively.
There is some code in Panda, particularly on the cvs trunk, to support sharing the same cull traversal between multiple points of view. This can be a significant savings on your CPU when you are rendering the same scene multiple times. It only works if (a) all of the involved points of view have exactly the same render state, and (b) they are largely looking in the same direction. This is usually true for left/right stereo pairs, for instance.
It works like this: Panda will automatically share the cull traversal for any two DisplayRegions that both use the same Camera node. The Camera, in turn, might then have a different lens for each DisplayRegion, which can be pointing in slightly different directions. On the 1.6.x branch, these different lenses must strictly be the left and right eyes of a stereo pair; but on the cvs trunk, they may be any two (or more) arbitrary lenses.
Alright- Pstats is reporting roughly a 60/30/10 breakdown between Draw, Cull and App tasks. The the time spent in Draw and Cull is more or less an even split between my 5 cameras/buffers, in dr_0 in all cases. Combining the culling would chop time by 10-15%, but I don’t know enough about the draw task to assess whether it’s spending the time doing computation, chewing on RAM, or trying to funnel stuff to the GPU. In the first two cases, however, getting different draw tasks onto different CPUs I’d think would speed things up (particularly if each core has enough of a cache to avoid constant accessing/blocking of system RAM).
Also, is there a way to control where pstats pops its window up? The difference between running my app in fullscreen and largest-comfortable-window mode is the difference between needing 1024-tall and 2048-tall buffers, which is certainly going to have an impact, but when I run in fullscreen, the pstats window opens on the same monitor as Panda, ‘behind’ the Panda output, and whenever I touch anything on the non-Panda screen, Panda minimizes and halts, never to return, meaning I can’t very well do dynamic load analysis.
Draw is designed to spend as much time as possible funneling stuff to the GPU. The split between cull and draw is that cull is all of the pre-processing work that needs to be done on the CPU before sending data to the card; and draw is actually sending data to the card. In any case, you wouldn’t be able to split draw up onto different CPU’s and still draw them all to the same graphics context, since graphics contexts are single-threaded by nature. Splitting up draw properly really requires you to have separately-addressable graphics hardware for each separate draw thread.
Of course, it’s academic anyway, since the required support for this is not yet complete in Panda. You have to find a way to cram it all onto one CPU, sorry.
The easiest way to solve your desktop issue with pstats is to run pstats on a remote machine. This is usually better anyway, since it won’t compete for CPU resources with the host machine.
Alright, well since the company only sees fit to give me one machine, and I’m not allowed to connect any machines they haven’t given me to the network, I guess I’m stuck.
I’m also currently debating using stencil buffering somehow on a per-camera basis since certain pixels in each view are simply never used. I think I’ve mostly wrapped my head around how to properly attrib the scene to interact with a stencil buffer, but unless I’m mistaken, it doesn’t look like there’s a good way to write a custom per-pixel blitting pattern to a viewport/camera’s stencil buffer at the outset, only ways to have pieces of geometry write to any pixel they’re visible on.
Since 2/5 of any given camera’s buffer pixels are never used, masking them out of the render process seems like it would save some time on complex scenes, but I concede that if most of the draw time is just shoveling textures to/from GPU, reducing the time spent filling textures to begin with may not have much impact in my current situation.
Hm… It occurs to me that, for the most part, I shouldn’t need to send massive texturing to/from the GPU to begin with unless I’m running out of space on the GPU. When I’m in custom-view-compositing mode, my 5 camera buffers aren’t used for anything beyond piping back into the shader, so each buffer should just be resident on the card, filled by the card, then accessed by the shader on the card before finally maybe the shader output image has to be piped back to the CPU for tweaking as an output texture. Is there a way to inform Panda that a buffer shouldn’t need to be constantly sync’d between GPU and RAM, or a particular type of buffer that won’t get copied until actively requested, or is this sort of optimization already such a part of Panda and the OpenGL pipeline that it’s probably already happening unless I’m doing something really stupid on my end?
The CPU doesn’t do any processing of render buffers at all, unless you go out of your way to ask it to (and you write the code to do this). I’m not sure what you mean by “output texture”, but Panda certainly won’t automatically ship your textures back to the CPU just to put them onscreen.
Panda also doesn’t reload your textures or vertex buffers unless they’ve been modified, so I’m not sure what you’re thinking about with “shoveling textures to/from CPU”.
During a normal draw cycle, the CPU is alternately issuing state changes and draw-primitive calls. It doesn’t sound like much, but it’s enough: in a complex scene, this can easily consume all of your CPU resources. Usually, a high draw time is indicative of too many Geoms in your scene; a general rule of thumb is to aim for around 300 or so for a consistently good frame rate. Of course, since you are drawing the scene five times, you will have five times the number of Geoms you would have had otherwise. Pstats will tell you how many Geoms you are actually drawing.
Stencil buffers might help if your bottleneck is pixel fill, particularly if you are using a complex pixel shader. Easy way to check for a pixel fill bottleneck is to see if the frame rate increases when you reduce the size of the window. If pixel fill is not your bottleneck, using a stencil buffer isn’t going to help.
Part of it does look to be a geoms issue- 400-800 total depending on the app- but render speed definitely changes with window size. I just thought of one way I can optimize my shader a little, but the shader is only being used on one buffer/object to composite the 5 views, and not on any of the chugging cameras rendering the scene proper. So if there is a quick way to access the stencil buffer contents directly, it would be useful.
The 400-800 is, I’m assuming, across all 5 renders, and the scene is minimal enough to begin with (mostly cards and Alice primitives), so I’m going to try to push refinements on the pipeline before trimming it down any more. I’d rather have a ~100 geom scene that is a little sluggish than a ~50 geom scene that’s smooth but entirely unimpressive. [Edit: Also, a number of my demos can’t be reasonably flattened since most nodepaths move independently, although flattening one that could be produced modest results. Further, quartering the window size produced a nearly proportional speedup, so I am likely hitting fill rate independently of any other issues…]
I cut a number of samplings and conditionals out of my shader, for a tangible improvement, but that wasn’t a complete fix.
If it would be possible to set the stencil buffer once from CPU and then have it persist rather than sending it every frame, that would be ideal. Otherwise, I suppose I could attempt a separate shader for use on each camera, somehow, but by the time the fragment shader gets a color input, isn’t it a little late to be deciding whether that particular pixel coordinate should even be processed? That’s fundamentally what I’m after- the ability to say “don’t even bother rendering these specific pixels,” or the next best thing, “if a geom would render to this specific pixel, stop bothering as soon as you realize that.”
Is there a way, short of writing a module in c/c++, to get/handle the sort of array of unsigned chars needed to directly manipulate a texture’s byte data? It would be awesome if I could procedurally write my blit pattern into the texture at runtime rather than loading a pre-made custom-to-current-window-size texture for each camera in each app.
Cool- thanks. Turns out I’m not nearly as close as I though I was on the stenciling in general, though (tried it with loaded textures to begin with), so I may need to poke at that some more before proceeding.