Actor LOD performance

I’ve been experimenting with the Actors LOD system and am having trouble with he performance. When I don’t use any LODS, the performance is better than actually using LODS. Note that I’m just using a decimated version of the base model as my LOD model and everything like the animations work, but I notice that the performance tanks when I add the LOD to the model. Am I doing something wrong?

#Done inside a custom Actor class

m = ModelNodepath
r = self.getLODNode().getIn(self.getLODNode().getNumSwitches()-1)
name = 'new lod'
self.addLOD(name, 50, r)
self.loadModel(m, partName=name + '_lod_root', lodName=name, autoBindAnims = True)

when I name the partName variable to “modelRoot” the Actors animations no longer play as if something broke. I think this has something to do with each LOD being its own individual character with its own individual skeleton. If this is the case, I would need to know how to bind the LOD model to the base skeleton. Note that I am using hardware skinning which works perfectly fine, and that I’m using the panda3d-gltf library to load my models so I pass the model as a nodepath into the self.loadModel function.

The documentation wasn’t too explicate on how the LODs actually are working. The docs say that the animations of the model are bound to something but is not very specific. From my experiments, it only looks like the animations are Synced, not bound through the skeleton.

Am I missing something?

To narrow down your guesses, you can use.

Without Using LODS:


Using LODS


Hmm I actually really surprised why there is so much idle time (Wait) in general across both tests. Could there be a bottle neck somewhere? Anyway the test does confirm that the LODs are slowing down overall performance at least on a large scale (900 active actors with hardware skinning). I’m losing about a 1/4 of performance with LODs. Code is very simple:

None LOD code

from direct.actor.Actor import Actor as PandaActor
r = 30
ass = []

for x in range(r):
    for y in range(r):
        a = PandaActor(models = {'lod1' : 'npc.bam',
                                 #'lod2' : 'npc1.bam',
                                 #'lod3' : 'npc2.bam'
                                 },
                        anims = {"ArmatureAction" : 'npc.bam'},
                       mergeLODBundles = True)

        a.reparentTo(render)
        a.setPos(x * 2, 0.0,  y * 2)
        a.setLOD('lod1', 1000, 0)
        #a.setLOD('lod2', 30, 10)
        #a.setLOD('lod3', 1000, 30)

        #a.loop('ArmatureAction', True)
        ass.append(a)

for item in numpy.asarray(ass):
    item.loop('ArmatureAction', True)

LOD code:

from direct.actor.Actor import Actor as PandaActor
r = 30
ass = []

for x in range(r):
    for y in range(r):
        a = PandaActor(models = {'lod1' : 'npc.bam',
                                 'lod2' : 'npc1.bam',
                                 'lod3' : 'npc2.bam'
                                 },
                        anims = {"ArmatureAction" : 'npc.bam'},
                       mergeLODBundles = True)

        a.reparentTo(render)
        a.setPos(x * 2, 0.0,  y * 2)
        a.setLOD('lod1', 10, 0)
        a.setLOD('lod2', 30, 10)
        a.setLOD('lod3', 1000, 30)

        #a.loop('ArmatureAction', True)
        ass.append(a)

for item in numpy.asarray(ass):
    item.loop('ArmatureAction', True)

Sorry I forgot to mention that I am using the multithreaded pipeline and that the IDLE wait time is caused by the threading. Here is some new data:

Non LOD:


With LOD:


I guess I cant use multithreading with pstats. Anyway we now get a cleaner look at performance and can see that all 3 stages of Culling and drawing. In fact it does look like the LODs are having a slightly lower frame time than without LODs but it is so negligible. With 100 Actors and LODs i get 1.754386ms and without LODs I get 1.9047619 which is not the performance gain I expect. These LOD model are highly decimated and are like a 10th of the triangle count of the base model. It looks like multithreading actually tanks the performance of a LODs (but threading still outperforms non threaded LODs).

I think you also missed the fact that any optimization puts a strain on the CPU. A simple example is culling the geometry before sending it to the GPU. Since the system consists of queues of tasks, in fact, you decided how to optimize one task with the help of another. This task took up the total time in the end.

Are you saying that in adding the LOD system to reduced overhead, the actual new optimization system (the LOD) itself is causing more overhead performance loss due to it costing more to run than not to have it? I guess if this is true, then I will have to look at different methods to optimize Actors. (I mean there is instancing but I want LODs as my target is to render individual unique actors).

I don’t think specifically more, but in one place the overhead decreases (GPU), in another it increases (CPU). As a result, there is an alignment like on an old scale.

Huh, I guess that is the case. I reckon then having maybe a classic LOD system, where I just disable and enabled different actors would probably be better. Instead of parenting skeletons and syncing different models to each other, It would be more performant to disable and re enable. I’m going to go and experiment with that.

Here is my understanding: The LOD system as you use it reduces the number of triangles per mesh, at the cost of a slight CPU overhead. On old (and I mean REALLY old) GPUs, that was helpful, because there was a relatively low limit on triangles that can be drawn per frame. On modern GPUs, that is much less of an issue; The motto now is “A million triangles? That’s nothing!”

What you can do is to reduce the number of meshes, e.g. using RigidBodyCombiner to turn a whole group of trees into a single mesh that is displayed at a distance, where their motion is not visible to begin with. Or, given that you’re making a forest, you could use hardware instancing for much greater gains.

What @Baribal said. LOD only helps if you’re hitting triangle workload limits, but you have to go really crazy to do that on modern GPUs. What you’ve mostly added to here is assorted scene traversal overhead, which is much more quickly going to be a bottleneck, so you see performance tank.

PStats works just fine with the multi-threaded rendering pipeline, but you have to keep in mind that there are now multiple Frame graphs, one for each thread. You can visit the graphs for other threads via the different thread menus in the menu bar.

You can get more fine-grained information by double-clicking the categories on the left side of the time graph.

Does anyone maybe have some other alternatives? I was kind of thinking that maybe detaching an Actor LOD level from the scene when it is not in use, could potentially be something that could work well but I have yet to make and test that. The biggest problem with that would most likely be syncing up the animations.

What is the problem you are trying to solve in the first place?

I’m just trying to push as many actors in the scene as once. In short, I want some form of aggressive Actor LOD optimizations that allow me to render more actors. Its like in the those medieval war games where there are thousands of soldiers on the screen at once. I want to achieve that level of performance. I know instancing and could go a long way, but what if I want the animations of each actor to be different or play at a different time?

This is possible, you just need to provide each copy with an individual animation table, just like you do with transformations.

hmm, should this be done with C++? Reading and sending new poses for the bones every frame sounds a little bit slow if done on python when accounting for thousands of animated models in the scene.

You need to approach this methodically. What is the actual bottleneck you’re hitting when scaling up the number of actors? You can figure that out with PStats. If it’s in vertex animation, you can optimize by enabling hardware vertex animation. If it’s in updating the character, then it may be advantageous to use instancing (maybe choose N different frame offsets for actors playing the same animation and distribute those around). If it’s in something else, then not even instancing may help.