Frame rate performance drops with usage

logos · February 21, 2021, 6:07pm

My Panda3D (1.10.8 , Python 8.7, Windows 10, intel motherboard GPU) application’s frame rate (as displayed by the FrameRateMeter) declines with time in the application. I narrowed in on the cause by disabling much of its functionality until I found that managing building floor LoD due to moving around caused the decline in frame rate. I.e.with higher app functionalities disabled:

frame rate is constant if the camera doesn’t move
If floor display is disabled and I move the camera throughout the world for 15 minutes, the frame rate may vary somewhat with displayed complexity, but will return to the vsynced 60fps when display complexity is not high.
If floor display is enabled, as I move the camera through the world, the maximum achievable frame rate continuously declines. If I move the camera to display a simple view and stop moving it around, the frame rate remains degraded.

During three 10–15minute tests, frame rate started at vsynced 60fps and declined to 35, 37 and 36 fps. Per Windows Task Manager statistics for the application:

CPU% started at about 19% and dropped about 1/2% when resting at the end of the test.
Application memory started at 1825MB, rose to 2750-2830MB and then fell back through 1800MB after 10 minutes of inactivity.
GPU% started at (22.5, 25, 24)%, fell to (13, 16, 15)%, and doesn’t recover.

I repeated the test on a linux laptop with NVidia GPU. The statistic values were different (158 at start to 60 at finish) fps w/o vsync, but exhibited the same symptom of frame rate declining with usage.

Output from render.ls() from the start and end of one of the runs are nearly identical, with a 6 node difference due to me not getting the camera back to exactly the starting point.

My Panda3D app has 100+ multi-floor buildings. Each has a transform to local building space. The building’s walls, floors, etc are defined within building space. Each building exterior is displayed (in one of 2 resolutions managed by a panda3d LODNode).

Floors are viewed at highly oblique angles, and the detailed textures for several buildings are quite large. So, I decided to manage floor texture resolution myself. Each floor in a building is displayed as a small array of rectangular tiles with applied textures. For each tile location in a floor array, I pre-built 6 rectangles that are identical except for their texture resolution’s power of 2. As the camera approaches a building, I add the floor tiles to the building node and manage the resolution at each tile location through swapping out via detachNode() and swapping in via reparentTo(building_node) the tile with the appropriate resolution texture. When the camera moves away from a building, I detachNode() all of the building’s floor tiles from the building_node

The accumulating decline in frame rate is strongly associated with the accumulated volume of detachNode() and reparentTo() activity. Any ideas on how to maintain performance?

Thaumaturge · February 21, 2021, 9:25pm

What happens if you display the floors, but disable the code that detaches and reparents the tiles?

There are two possibilities that I’m inclined to suspect right now:

The program is continuously running “detachNode” and “reparentNode” calls, to such an extent that it’s slowing down your program.
or
Given that you have more than a hundred buildings, each with multiple floors, each of which in turn has multiple tiles, if I understand correctly, perhaps your program simply has too many nodes.

Which brings me back to my question, above: If you see the same performance degradation with the floors visible but the detach-and-reparenting code disabled, then I’m inclined to suspect option (2) above. If the performance degradation disappears, however, then perhaps (1) is the source of the problem, and it may be worth looking at the code that handles your detaching-and-reparenting.

All that said, if I may ask, does ordinary automatic mip-mapping not suffice for your purposes? It seems to me that what you’re doing is more or less what mip-mapping does: swap in texture-variants of appropriate size to the camera’s distance and angle relative to the target surface, if I’m not much mistaken.

logos · February 21, 2021, 11:40pm

My app’s init() sends a one-time message to a task to initialize the environment from several .bam containing maybe 9000 ModelNodes with NodePaths. After initialization, the currently enabled code has no NodePath() method calls.

I instrumented my program with counters to investigate your possibilities. I increment detach and reparent counters are with each detachNode() and reparentTo(), and every 10 seconds a task logs the counts for the last 10 seconds. In the first 10 seconds after initialization, the LoD task executes 317 reparentTo and 6815 detachNode to adjust for not being near to any building and not needing full resolution on the large, similarly managed, ground-plane of tiles.

A render.ls() listing at this point is about 1200 lines long – I.e. 500-ish nodes.
The every 10-second logging then reports 0 reparentTo() and 0 detachNode() as long as I don’t move the camera. When I enter my biggest building, it reports about 1800 reparentTo() and detachNode() over 10 seconds then between 0 (if I don’t move) to 550 (if I race around in the building) reparentTo() and detachNode() each 10 seconds.

When I move away from all buildings, a render.ls() listing is again about 1200 lines long. From what I see, the number of nodes and NodePaths is about the same in both my application and specifically in render, at the beginning and at any point at rest away from buildings. The application doesn’t call reparentTo or detachNode except as needed when moving through a space.

mipmap might have been simpler, but being somewhat new to this I was concerned that I needed more control for several aspects:

My virtual reality has semi-transparent walls and floors. I thought that I might need more control over tile size to mitigate sorting issues amongst near-by opaque and transparent objects.
I understood that the shader would select mipmap by screen size, but didn’t know how the results would be with the expected high anisotrophy. I might need to provide anisotrophy advice on a tile-by-tile basis.
I was totally new to modeling packages and didn’t know of a modeling package that would slice and dice a texture into sub-units with mipmaps w/o artifacts at the interfaces between the sub-units – so I did it myself.
Two large adjacent buildings would need their floors to be simultaneously displayed. The total texture memory required, including other parts of the environment, might exceed the capacity of some target system’s mother-board GPU. I was thinking x11-style forwarding across a network. Programmatically managing the resolution of each tile in the display tree, reduces GPU texture memory requirements and minimizes instantaneous bandwidth requirements for application to ‘GPU’ to just what is required to display the next few frames.

Thaumaturge · February 22, 2021, 7:41am

If I’m not much mistaken, five hundred nodes is quite a lot, presuming that they’re all visible. At the least, that may be contributing to your problem.

You do note that the number of nodes doesn’t seem to vary with your position in the level. However, if I’m not much mistaken, the number of nodes being actively rendered (i.e. not culled away) might still vary. I think that “ls” reports all (attached) nodes in the scene, not the number being actively rendered.

By the way, you might find the “analyze” method more useful for getting node-counts: while it still reports all nodes, I think, it actually reports the number of nodes (amongst other things), rather than listing the nodes themselves.

It’s used just like “ls”, I believe: myNodePath.analyze() (e.g. render.analyze())

As to the number of “detach”- and “reparentTo”- calls, I’m not sure–what you give there doesn’t seem like all that many per second, but I don’t know how what demands those methods place upon the engine, and thus how many is too many.

Perhaps one a forum-member better versed in such matters of performance will chime in!

Hmm… I wonder whether that might not be better handled via cull-binning. To the degree that it’s called for at all–Panda already has some handling for this, by virtue of automatically rendering transparent objects after opaque ones, and the former in distant-to-near order, if I’m not much mistaken.

You might–but you might not. I’m inclined to suggest trying the simpler solution first, and only going for the more-complex solution should the former turn out to be insufficient.

That’s fair–but you can also provide manually-made mip-maps to Panda to be used in automatic mip-mapping, if I recall correctly. I don’t know how it’s done offhand–I’ve never done it myself–but I seem to recall that it’s possible.

It might–but have you checked that it’s so? Again, this is something that I’m inclined to look to optimising only should it prove to be a problem.

rdb · February 22, 2021, 9:50am

You really ought to use PStats to tell which parts of the rendering process are being slow. Anything else is guessing.

As for the scene graph, if you have many objects, it is imperative that you don’t just have a flat list of them parented to render. Adding hierarchy to your scene graph (with objects belonging to a particular room or house parented to a common parent), so that the cull process will be more effective. But again, you won’t know whether the cull process is the culprit until you profile it with PStats.

Thaumaturge · February 22, 2021, 10:15am

That is a fair point–PStats may well give a much clearer indication of where the performance issue lies.

logos · February 24, 2021, 12:31am

PStats was very helpful.
While I’m moving through the most complex scenes, the sum of Draw, ‘’, and Cull is about 9ms. In the more moderate complexity at the the beginning and end this testing, the sum of Draw , '’, and Cull is averaging about 6ms/frame.
My scene consists of:

render
    camera
    building*
        LOD with 2 building exteriors
        floor_tile*
        other stuff which is turned off
    ground_plane_tile*

The application has many additional rectangular textured floor and ground tiles in Python data structures which it swaps in and out of the scene depending on camera position.

The problem is time within App.

I drilled into App to see that the time in my code is insignificant. If I fly through the most complex parts of my environment, I can push it up to 2ms/frame, but when I stop moving, it has nothing to do and its time drops to unmeasureable.

“Show code” is taking essentially all of the time.
Within “Show Code”, all of the time in in garbageCollectStates.
Prior to moving around through my environment, garbageCollectStates averages about 10ms/frame with spikes to 17-19ms/frame. Its time/frame grows with accumulated changes from moving the camera around through my environment until garbageCollectStates is averaging about 27ms/frame with spikes of 40–60ms/frame.

At the end of the test, I returned the camera to about where it was at the beginning. After 15 minutes with no camera motion, and thus no changes to the render tree, garbageCollectStates doesn’t recover. Even if I point the camera out into nothing.

render.analyse() at the end of the test produces:

At highest LOD:
1060 total nodes (including 0 instances); 260 LODNodes.
336 transforms; 18% of nodes have some render attribute.
456 Geoms, with 261 GeomVertexDatas and 3 GeomVertexFormats, appear on 331 GeomNodes.
156796 vertices, 150200 normals, 0 colors, 6600 texture coordinates.
GeomVertexData arrays occupy 3795K memory.
GeomPrimitive arrays occupy 457K memory.
31 GeomVertexArrayDatas are redundant, wasting 32K.
122 GeomPrimitive arrays are redundant, wasting 27K.
79484 triangles:
  76186 of these are on 37205 tristrips (2.04774 average tris per strip).
  3298 of these are independent triangles.
72 textures, estimated minimum 6469K texture memory required.


At lowest LOD:
1060 total nodes (including 0 instances); 260 LODNodes.
336 transforms; 18% of nodes have some render attribute.
456 Geoms, with 261 GeomVertexDatas and 3 GeomVertexFormats, appear on 331 GeomNodes.
156796 vertices, 150200 normals, 0 colors, 6600 texture coordinates.
GeomVertexData arrays occupy 3795K memory.
GeomPrimitive arrays occupy 0K memory.
31 GeomVertexArrayDatas are redundant, wasting 32K.
122 GeomPrimitive arrays are redundant, wasting 27K.
79484 triangles:
  76186 of these are on 37205 tristrips (2.04774 average tris per strip).
  3298 of these are independent triangles.
72 textures, estimated minimum 6469K texture memory required.


All nodes:
1189 total nodes (including 0 instances); 260 LODNodes.
336 transforms; 16% of nodes have some render attribute.
710 Geoms, with 390 GeomVertexDatas and 3 GeomVertexFormats, appear on 460 GeomNodes.
197278 vertices, 190682 normals, 0 colors, 6600 texture coordinates.
GeomVertexData arrays occupy 4783K memory.
GeomPrimitive arrays occupy 575K memory.
170 GeomVertexArrayDatas are redundant, wasting 46K.
191 GeomPrimitive arrays are redundant, wasting 44K.
99872 triangles:
  96574 of these are on 47256 tristrips (2.04363 average tris per strip).
  3298 of these are independent triangles.
72 textures, estimated minimum 6469K texture memory required.

logos · February 24, 2021, 2:40pm

The time spent in the garbage collector is moderate at start; increases with exploring the environment, but reaches a level at which further exploring does not cause increased garbage collecting time. Finally, when the app is not doing anything, garbage collector stays at the higher time/frame.

This seems correlated with the number of nodes that have at sometime been in the scene graph. At start, the app loads many nodes for later use in the scene, and quickly places some of nodes (corresponding to camera view) into the scene. As the camera explores the environment, the app reparents nodes to the scene, and then detaches them as the camera moves on to somewhere else. After a couple of tours, most of the loaded nodes have at some point in time been in the scene. Further exploration does not introduce further nodes to render, and the garbage collection time levels off.

It seems that every NodePath(GeomNode) that my app introduces into the scene leaves behind a component that the garbage collector must examine, but can never get rid of.

When I was writing the code for managing arrays of textured tiles, I designed 1 ModelRoot (GeomNode Plane) per location, backed by a number of textures that I would swap onto and off of the ModelRoot. I didn’t understand enough about Panda3D texture semantics to get the texture setting to work. So I modified my code to create a separate ModelRoot for each texture and dynamically reparentTo/detachNode the ModelRoots.

Given the performance association with number of ModelRoots seen by render, I’ve now gone back to this code and modified it to create just a single ModelRoot GeomNode Plane for each location, backed by multiple textures that I dynamically set to a single TextureStage on the ModelRoot. This doesn’t change the number of textures that are or have been in the scene, but significantly reduces the number of NodePath(ModelRoot) that have at some time been in the scene.

When I run the app with this modified code, garbageCollector time starts at 1.4ms/frame (down from 10ms), and climbs to 5ms/frame (down from 27ms). This restores performance headroom so that I can proceed with functionality beyond just navigating a rich environment.

I’ll be dynamically adding and removing other objects from the scene. Is there a best practice to avoid a penalty in accumulating garbageCollector time/frame?

rdb · February 24, 2021, 2:48pm

Time spent in garbageCollectStates indicates you have too many TransformState or RenderState objects in existence (you’d have to check one of the PStats menus to find out which).

A TransformState represents a unique transformation applied to a node (eg. setPos, etc.). A RenderState is a unique combination of render attributes, including texture, material, color, shader input, etc. parameters. If you do a lot of separate setTexture calls to separate nodes (as opposed to doing them on a higher node) or using unique TextureStage objects (rather than reusing them where possible) then you’ll see this count increase.

In principle if you have many nodes but they all have the same transformation or render attributes applied it shouldn’t cause an increase in garbageCollectStates (but there are other good reasons for reducing the count of nodes).

You should be able to cause a reduction in the number of states by preventing many leaf nodes from having different states.

logos · February 24, 2021, 10:10pm

Thanks. To make sure that I understand this:

I can use e.g. loadModel to load a model and store its NodePath in a Python data structure. Even though the model has unique texture and unique setPos, this does not yet create a TransformState and/or RenderState. When I reparentTo the model into the scene graph, (possibly the Draw process of) Panda3D creates such state objects associated with each model. When I detachNode a model from the scene graph, the added TransformState and/or RenderState remain associated with the model. The states would be candidates for garbage collection but can’t be collected because of their association with a model that still has a reference count from the NodePath in Python.

I have one common TextureStage that I use with each setTexture(). When modeling floor plans and aerial photography on the ground plane, each Geom will inherently have a unique texture – thus unique RenderState. I’m currently representing each building floor plan as an array of instance[i,j] of (i.e. separate loadModel()) of a unit square plane model, to which I setTexture, setPos, and setScale. If I need to further reduce States, I could reduce the number of TransformStates by constructing a unique GeomNode Plane for each [i,j] that incorporates its position and scale in its geometry coordinates. More Geometry and coordinate storage, but fewer TransformStates and a graphics pipeline less interrupted by transform changes.

rdb · February 24, 2021, 11:34pm

There may be a number of states already associated with the model when you load it (most likely, one RenderState per unique material). You can call nodepath.analyze() to print out the statistics for any particular subtree of the scene graph, including a loaded model.

When you call something like setTexture, you are creating a new RenderState on that node, which yields more work for Panda to garbage collect. Same with setPos and a TransformState. I’m not sure that detaching has any effect on the garbage collection process.

You can use flattenLight() (or heavier flatten commands) to flatten positions and certain render states onto the geometry. This is similar to what you are suggesting with GeomNode planes, except Panda would do it automatically.

However, instancing a single quad model for each floor tile may not be the right way to go, in my opinion. You end up with a lot of separate Geoms (with separate states) that are going to cause you a lot of performance issues down the road. It would be most efficient to learn the procedural generation classes in Panda (documented in the manual) to build up a building structure using triangles.