Increasing performance: "Primitives" in PStats

I’ve been experiencing a somewhat-low frame-rate in my current project. In the hopes of diagnosing the source of the issue, turned to PStats, and now could use some help with interpreting the results.

Note that the results described below were obtained with “gl-finish” set to “1”; without that I end up with a significant chunk of time spent in “clear”.

Looking at the graph produced by PStats, I seem to have a lot of time spent in “Draw->Primitive”, and specifically in the sub-graph of that which is labelled “Setup”. A look at the “Primitive batches” graph showed a value of over one hundred (I think that it was), but while flattening reduced that number to around fifty, this doesn’t seem to have produced a significant increase in the frame-rate. I do note that the primitives in question seem to predominantly be categorised under “Triangles”, rather than “Triangle-strips”.

So, what is “Setup”, and is there anything that I might do to reduce the amount of time that it’s taking up? Are there still too many primitive batches? Are there any other values that I should look at?

Finally, I do note that reducing the complexity of my custom CG shaders seemed to help: while “Draw->Primitive->Setup” still predominated, the associated value was slightly reduced and the frame-rate increased a little. (I do realise that I should probably be using GLSL shaders; these shaders are hold-overs from before I picked up GLSL, and aren’t intended to be part of the final product.)

Instead of setting gl-finish 1, you may consider setting “pstats-gpu-timing 1” if your GPU supports it, and switching to the GPU results in the PStats menu. It may give more useful results.

I wonder, do you have a lot of geometry that is being modified each frame, for example animated models? You may benefit from enabling hardware skinning.

Ah, that’s an option that I wasn’t aware of! Interesting, and thank you! :slight_smile:

I presume that “the GPU results” refers to the graph produced by clicking on the name of my GPU in the main PStats window, and then selecting either “Frame” or “Piano Roll”. The results below are based on that presumption.

What I see now are the following:

  • “Frame” is nearly evenly split between “Draw” and, well, “Frame” (which I presume is just miscellaneous time spent in the frame).
  • Under “Draw”, most of the time appears to be spent in “window1->dr_0”.
    [list][]There is only one such “window” entry, and a total of three “dr_” entries; of the “dr_” entries, only “dr_0” seems to account for much time.
    []The piano roll, if I’m reading it correctly, seems to bear this up: lots of time in “Frame” and “window1->dr_0”.[/:m]
    []If I set both “gl-finish 1” and “pstats-gpu-timing 1”, I seem to see less time in “Frame” and more in “Frame->Draw->Set State”, although “Frame->Draw->window1” still seems to account for more time.[/:m][/list:u]

I don’t believe so, no–not unless I’ve done something wrong somewhere. The scene is actually fairly simple: some terrain geometry, a small building with some interior objects and one Actor. There are a fair few Bullet objects, but PStats seems to indicate that Bullet isn’t a significant issue at this point. I also have a fair few logical objects, many of which have a few PandaNodes associated with them, but most of which don’t terminate in GeomNodes.

Hmm… Looking at the output of “render.analyze()”, I do see that I have:

101 total nodes (including 0 instances); 0 LODNodes.
54 transforms; 10% of nodes have some render attribute.
26 Geoms, with 22 GeomVertexDatas and 5 GeomVertexFormats, appear on 16 GeomNodes.

Does that seem like too many, or are those numbers reasonable?

Hmm, that’s not a lot of nodes. I’m still lacking information that might explain bad performance with this. Do you happen to have complex GUI? Can you show render2d.analyze() ?

How complex is your geometry, does this not show up in the analyze() call output?

What is the actual frame time you’re seeing with this?

Before I go on, I feel that I should note that this analysis is being done on my development computer, which is somewhat slow. I’d been ignoring the framerate observed there until testing on a faster computer also showed a surprisingly low frame-rate.

It’s also worth noting that, on the test-computer at least, reducing the resolution at which the game runs has a significant effect on the frame-rate, which I imagine suggests that the problem less likely one of scene complexity.

Here’s the full output of “render.analyze()” (including the above):

101 total nodes (including 0 instances); 0 LODNodes.
54 transforms; 10% of nodes have some render attribute.
26 Geoms, with 22 GeomVertexDatas and 5 GeomVertexFormats, appear on 16 GeomNodes.
26112 vertices, 26112 normals, 23158 colors, 26112 texture coordinates.
7274 normals are too long, 1356 are too short.  Average normal length is 1.11484
GeomVertexData arrays occupy 1443K memory.
GeomPrimitive arrays occupy 189K memory.
1 GeomPrimitive arrays are redundant, wasting 1K.
32137 triangles:
  108 of these are on 23 tristrips (4.69565 average tris per strip).
  32029 of these are independent triangles.
14 textures, estimated minimum 5624K texture memory required.

Not really, I don’t believe. There’s some complexity to it, but most of the UI is hidden during the periods of gameplay that I’m looking at.

Here you go:

217 total nodes (including 0 instances); 0 LODNodes.
112 transforms; 28% of nodes have some render attribute.
16 Geoms, with 16 GeomVertexDatas and 2 GeomVertexFormats, appear on 16 GeomNodes.
180 vertices, 180 normals, 0 colors, 180 texture coordinates.
GeomVertexData arrays occupy 6K memory.
GeomPrimitive arrays occupy 1K memory.
13 GeomVertexArrayDatas are redundant, wasting 2K.
110 triangles:
  30 of these are on 15 tristrips (2 average tris per strip).
  80 of these are independent triangles.
14 textures, estimated minimum 7164K texture memory required.

Hmm… I’m actually surprised that I have quite so many nodes below render2d. Perhaps my UI is more complex than I thought. Nevertheless, most of the UI should be hidden at the point at which the above call to “analyze()” was made, if I’m not much mistaken.

On my development computer, with PStats running, “gl-finish 1”, and “pstats-gpu-timing 1”, and if I’m reading this correctly, I seem to have a frame-time of around 53ms.

It might be worth noting that the frame-rate varies significantly depending on where I look within the scene; the above frame-time was taken from a vantage that I’ve been using as a more-or-less constant for the purposes of testing.

(Is it feasible to run PStats with a distributable version? I’m guessing not, but it might be helpful to be able to see the relevant graphs on the testing computer, and I’d rather not install the my development tools on that machine.)

Firstly, don’t set gl-finish. You’ll just get skewed readings since gl-finish severely impairs your performance.

Does the framerate improve roughly linearly when you resize the window to half the size? If so, you’re probably fill-rate bound, and your fragment shader might be causing the issue. I’d like to take a closer look at your shaders, if possible.

You could also use apitrace to collect a trace and then send that to me so I can take a look at which OpenGL calls your program is making and which textures/shaders are being used.

Well, my computer just overheated and cost me my original reply, so the following will likely be rather less detailed than it originally would have been. ^^; (I’ve been having some intermittent issues with overheating on my development machine, exacerbated, I daresay, by the fact that it’s summer in this part of the world.)

Fair enough. I was basing my use of it on this old thread; as in that thread, I was seeing a lot of time in “clear”, as I recall. However, I do now see that the GPU graph produced via “pstats-gpu-timing” seems to be largely unaffected by the removal of “gl_finish”–can I take it, then, that “pstats-gpu-timing” is a partial replacement for “gl_finish” in dealing with the issue of time spent in “clear”?

If I guess correctly that “half the size” should be interpreted as “half by area”, rather than “half by dimension” or “half by perimeter”, then on my development computer it does appear to be roughly linear, yes.

I don’t mind, but I’ll note here that I have four shader files (one of which is a “common” shader, included by the others), and would want to include some Python code as context. Should I perhaps PM the relevant code to you?

(For what it’s worth, when tinkering with these shaders I’ve found most of my gains seem to come from reducing the number of “tex2D” calls; even significantly simplifying my perhaps-slightly-arcane lighting shader seemed to have little effect.)

I’m not familiar with apitrace, so I hope that I’ve installed the correct program. ^^;

That said, it appears to want an executable to run, so I may get back to you on this when I next create a distributable build.

Yes, pstats-gpu-timing avoids the need to use gl-finish since it measures GPU time and not CPU time.

You can just point apitrace at the python executable. Since we’ve established that it’s probably the fragment shader, sending me the fragment shaders would work too.

It’s not possible to profile a distributable application because debugging and profiling information is compiled out of the runtime panda3d distribution.

Ah, fair enough, and thanks!

Ah, right–I feel silly for not having thought of that. ^^;

You should have a PM containing both the trace and the shaders–use whichever you prefer.

Ah, I feel silly for not having thought of that, especially with regards to PStats usage! ^^; (I did think that apitrace might be able to work in a manner similar to a DLL-tracer, but presumably not.)