skinning slow - GeomVertexAnimationSpec.setHardware()?

Running pstats shows that 80% of the time our game spends on skinning. And I’m not really happy with the framerate.
So I thought I’d see if using hardware skinning would help. As I generate the Actors myself, I thought I could use setHardware() on my GeomVertexAnimationSpec.

I’m getting this error:

Assertion failed: _supports_vertex_blend at line 2344 of c:\buildslave\release_s
dk_win32\build\panda3d\panda\src\glstuff\glGraphicsStateGuardian_src.cxx
Traceback (most recent call last):
  File "C:\Panda3D-1.8.0\direct\showbase\ShowBase.py", line 1844, in __igLoop
    self.graphicsEngine.renderFrame()
AssertionError: _supports_vertex_blend at line 2344 of c:\buildslave\release_sdk
_win32\build\panda3d\panda\src\glstuff\glGraphicsStateGuardian_src.cxx
:task(error): Exception occurred in PythonTask igLoop
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Panda3D-1.8.0\direct\showbase\ShowBase.py", line 2910, in run
    self.taskMgr.run()
  File "C:\Panda3D-1.8.0\direct\task\Task.py", line 502, in run
    self.step()
  File "C:\Panda3D-1.8.0\direct\task\Task.py", line 460, in step
    self.mgr.poll()
  File "C:\Panda3D-1.8.0\direct\showbase\ShowBase.py", line 1844, in __igLoop
    self.graphicsEngine.renderFrame()
AssertionError: _supports_vertex_blend at line 2344 of c:\buildslave\release_sdk
_win32\build\panda3d\panda\src\glstuff\glGraphicsStateGuardian_src.cxx

Also what are the arguments “num_transforms” and “indexed_transforms” for? You don’t need to specify them manually when you use setPanda().

Anyway, I hope this is not a hardware limitation, I kind of hoped Panda does hardware skinning with shaders. The game we are porting to Panda3d uses DirectX8 though and I’m not sure if there were such shaders back then.
I also hope it’s not because of Panda not being very fast at skinning compared to the original engine of the game.

Maybe I could do some optimizations myself by merging the skeletons of character and clothing Actors into one. Not sure if that would really help, as the performance issue is caused by skinning, not bone transforms?
I also need to frequently change some of the player’s clothing, so if merging their bones will prevent me from doing that, then it won’t be a solution.

I remember reading that the hardware vertex animation is some sort of fixed-function pipeline thing, an ancient technique that was limited to only a few bones and never very popular. It’s not the shaders-magic variant :frowning:

You are right, the hardware skinning is not supported on most video cards.
If you expand Animation in pstats this will give you the breakdown of how much time is spent on bones vs. skinning.
The easiest way to reduce CPU used by skinning is to reduce the number of vertices being animated by using LODs, or simplifying your models if LODs are not possible (in a side-scrolling game for example).

Honestly, the original game engine doesn’t seem to be using any shaders and hardware skinning, yet it runs faster than Panda3d on my 2.27 GHz Core i5.
Like I said, we are porting an existing game, so we can’t make more assets ourselves (for LOD), but even if we could take the time and make LODs for around 500 models, it would still be pointless because of the type of the game.

So I’m hoping skinning is not a weak point of Panda3d and there are some optimizations I can make.

Here is some more pstats info:

One thing you could try is a custom Panda3D build with threads disabled and the eigen library enabled. If you are not using threads in your program this should give you a good performance boost without making any changes to your program.
Is it possible that in your original program the animation on the models was playing at a limited frame rate (commonly 24 or 30 fps), and in the Panda3D port it is animating every frame? That would result in much more calculation in the Panda3D version, reducing the overall frame rate. Not sure how you are doing your animation, so this would be a potential thing to watch out for. Also check the number of Geoms in pstats (in one of the menus at the top), if your models are made up of multiple objects it could quickly add up and become a bottleneck.

Never thought threading would make things slower instead of the opposite.
I tried using 1.7.2 which I think didn’t use threading to see if it’s any better, but as the game depended on many things added in 1.8.0 (buildbot), I had to disable all textures and sounds before trying it. Skinning did seem to take 3x less time, but I’m not sure if disabling all the above had an effect on the results.

I could try to compile Panda myself, but the last time I checked you couldn’t compile on a 64bit machine properly.

Haven’t heard of the Eigen library, had a look and it seems like a replacement for Panda’s math engine. Would be interested to know if it’s faster.

I generate AnimBundles from the 3d animation files, like Panda does for EGG files. Here you can specify the framerate (which is 45). So I don’t think that’s a problem.

The highest it gets is around 160.
I don’t think that any optimization can be made, as each geom has it’s own texture. And things aren’t really bad if no animation is played.

Anyway, thanks for the tips.

PS. If threading can have the opposite effect on performance, maybe it would be a good idea to have two versions of the SDK? Maybe only for the official releases?

I’m not aware of any performance bottlenecks in Panda’s skinning, but it is true that the way the GeomVertexData objects are constructed is critical. If you’re constructing them by hand (instead of using the egg loader), how are you setting up the TransformBlendTables? Make sure that the total number of blends is not unreasonable; you should share the same transform for as many vertices as possible. Also, the order of vertices in the GeomVertexData is important; you should group vertices that share the same transform together consecutively in the GeomVertexData. If your GeomVertexData is inefficiently structured, your skinning performance can suffer by 10x or more.

Using Eigen should add a bit of performance gain, but it’s a nuisance to compile.

You can also try enabling LOD-based animation for another easy win. This avoids computing animation frames as frequently for actors that are farther away from the camera. See Actor.setLODAnimation().

Enabling threads does indeed sacrifice some performance in the single-threaded case; and this is inevitable in any system. This is why the 1.7.2 builds and earlier did not have threads enabled by default; but we finally enabled them because so many people wanted to use threads in their application these days. Providing dual builds is a possibility, but it sure does make things more complicated on our side, and we just don’t have the manpower.

David

I don’t think I understand. I’m reading a 3d file like egg, how can I know which vertices share the same weight or bone index at a reasonable speed? I don’t think I can. Or I completely misunderstood what you said.

I’m afraid LODs won’t make sense for this type of game, besides, we have the assets we have to work with done. Still, I didn’t know about animation LODs, thanks.

I think after setting up the buildbot to do it it will be an automatic process. But on the other hand it will take more space and time, which might be a problem.

Aren’t you creating the TransformBlendTables? And creating the TransformBlend objects that go into those tables? It’s up to you to minimize the number of unique TransformBlend objects (this is the biggest performance bottleneck for skinning). The egg loader does this by sorting the TransformBlend objects into an STL map as it goes.

Animation LOD has nothing to do with creating new assets, but if you’re building an RTS with all objects at an equal distance from the camera, then yeah, LOD doesn’t make sense.

Yeah, but there’s already a whole bunch of different builds, and maintaining all of them isn’t free.

David

So, the egg loader actually checks if vertices share the same bones and weights and uses the same TransformBlend object if so?

Yes, absolutely. It also reorders vertices so that vertices that share the same TransformBlend object are consecutive.

For that matter, it also collapses identical vertices, and collects neighboring triangles into triangle strips. It does these and many similar such optimizations in all geometry it loads.

David

Well, some of our loader code is in Python right now, so that’s why I wasn’t a fan of doing rendering optimizations in sacrifice of load time.

But I tried your suggestion: I used Python lists and checked if identical BlendTable existed each time. It was waay slower to load, but I don’t see any difference.
For one model I narrowed the BlendTables from 16000 to 500, but I see no difference in the framerate and the pstats graphs.
Is it possible Panda does the optimizations itself anyway?

Panda will not attempt to optimize an already-created TransformBlendTable; and I’m truly astonished that you didn’t see any performance difference at all after reducing the table size by 96%.

Would it be possible for you to use NodePath.writeBamFile() to save out a created Character and AnimBundleNode to a bam file you could send to me for my own inspection?

David

Yes, I also made a screenshot of the pstats graphs and compared them side-by-side, seem identical.

Okay, I’ll pm you the dumped bams.