Low performance until pstats is connected

eldee · November 18, 2020, 8:21pm

I discovered a weird behaviour recently in my app: the performances are sometimes awful, instead of taking around 5ms to render a frame, the time skyrockets to 30-50ms ! However, the moment I connect the app to pstats the performances are back as expected and the render time drops again to 5 ms. And if I disconnect the app from pstats, the problem occurs again.

This happens when the camera is close to a body, so there is lots of quadtree parsing and culling occurring, but the total time to do perform this is around 2ms, so, far from the tens of ms I observe. And, if I disable the quadtree traversal and update, the performance drop disappears too !

For the sake of completeness, I should mention that I’m using the official Panda3D SDK 1.10.7 on Ubuntu/Mint 20.04, I tested with both vsync enabled and disabled and that the quadtree implementation is in Python

I honestly can’t see what is the link between pstats, my code and this weird behaviour… nor where I should look at to find the root cause

eldee · November 18, 2020, 8:53pm

The problem still occurs even if I disable the spawning/removal of the patches in the scene, it seems the problem is linked with the processing time of the quadtree alone (But it’s about 2ms when pstats is active).

Also, when the problem occurs, the rendering time is no longer constants instead it varies wildly up to more than 100ms (at least if I can trust the render time returned by globalClock.get_dt() )

The scene itself is not really crowded :

88 total nodes (including 0 instances); 0 LODNodes.
45 transforms; 45% of nodes have some render attribute.
36 Geoms, with 36 GeomVertexDatas and 3 GeomVertexFormats, appear on 36 GeomNodes.
35830 vertices, 35822 normals, 8 colors, 35822 texture coordinates.
GeomVertexData arrays occupy 1960K memory.
GeomPrimitive arrays occupy 397K memory.
28 GeomPrimitive arrays are redundant, wasting 336K.
67592 triangles:
  8 of these are on 4 tristrips (2 average tris per strip).
  67584 of these are independent triangles.
0 lines, 8 points.
12 textures, estimated minimum 12417K texture memory required.

rdb · November 19, 2020, 1:45pm

Interesting. I have heard someone else report a similar counter-intuitive behaviour, namely that they were getting bad performance in an empty scene, which improved as they started adding objects. I wonder if these reports are related.

My only vague attempt at a hypothesis for this is that Panda is perhaps doing too little on either the CPU or GPU, perhaps causing some bizarre timing issues in the driver due to never yielding a timeslice or queuing up too many draw calls or something like that.

Does slowing down the program in general cause the performance to become more stable, such as by putting client-sleep 0.001 in Config.prc (or time.sleep(0.001) in a task)? You could also try clock-mode limited and clock-frame-rate 30.

Since PStats can’t be used to shed light on this, it may be helpful to use a C++ profiler (my favourite on Linux is perf, or RotateRight Zoom for a GUI app). This requires compiling Panda with export CXXFLAGS=-fno-omit-frame-pointer set.

eldee · November 19, 2020, 4:46pm

I guess it’s a case of Heisenbug ! I ran my app under perf and weirdly enough, the performance issue disappeared, the frame rate even jumped above 100.

Also, I tried running the app while recompiling Panda3D, and again there was no performance drop until the end of the build.

client-sleep, clock-mode and clock-frame-rate do not have any impact on the performance.

I tried adding a while i < xxx: i+=1 loop in a task to increase the processing load, but it did not change the performance level

eldee · November 19, 2020, 5:21pm

Actually, I have another asynchronous thread that is used to load data, textures, and so on in the background. If I deactivate its usage, the performance problem disappear.

Inside one of its task there is a queue.get_nowait() which cause that task to run endlessly. If I replace it with a queue.get(timeout=0.001) the problem disappears also!

The task chain is created with the default values, so I assume there is no priority conflict:

        self.base.taskMgr.setupTaskChain(name,
                                         numThreads = 1,
                                         tickClock = False,
                                         threadPriority = None,
                                         frameBudget = -1,
                                         frameSync = False,
                                         timeslicePriority = None)

But I guess the constantly running thread was enough to confuse the system scheduler and probably caused the main thread to no longer run smoothly

rdb · November 19, 2020, 5:39pm

Interesting—any reason why you won’t use the blocking form of get() if it’s in a thread anyway? (Assuming that that call releases the GIL while it’s blocking.)

I assume you’re using Python 3, right?

eldee · November 19, 2020, 9:56pm

Yes, I’m using Python 3.

I tested again (I wrote that part of the code a long time ago), there are two reasons I was using nowait() :

One of the queue is used in a task of the main task chain, and so keeping it blocking freezes the whole main task chain (and so the app). (and as it’s part of the main task chain, the task is only run one time per frame).
The other queue is used in it’s own async task chain. However, keeping it blocking causes the application to never stop if the main taskchain crash. I guess I could use a atexit() handler and kill the taskchain, but the nowait() did the job

Having a small timeout is anyway much better in all cases !

Melan · August 24, 2022, 8:37am

Hi

Sorry to bring this old topic back.

I have a similar issue where Pstats improves and stabilizes the frame rate: this is in particular true when I use a interrogate module aiming at managing object hw instances.

Adding simple (non instanced) objects to the scene also seems stabilizing the frame rate.

In addition to what was mentionned in this thread, have you identified any other clue?

Edit (food for thought…):

After a serie of tests, I found that the primary culprit was the threads initiated from my interrogate classes. I originally thought that it would help to improve the calculations in CPU… I rewrote my classes to not rely on any thread and I now have a steady frame rate (~15ms) without PSTATS.
I indeed tried several solutions (e.g. sleeping the threads, slowing down the task from where I call the interrogate methods…) but they all conduct to make the calculation too slow.
However, even with no more thread from interrogate classes, activating PSTATS (with sync-video True, statement not true if sync-video is False) still has an effect by making the frame rate fluctuating (but with a frame rate going down more to 8.33ms, which is better to a certain extent)

Not sure this is a high priority question but could be interesting to understand.