Reparenting constructed GeomNode to render causes ShowBase to freeze

logos · May 1, 2021, 4:44am

In a simple ShowBase app, I wrote and tested methods to construct a Geom, add it to a new GeomNode referenced by a new NodePath and reparent the NodePath to render. I see exactly what I expect when I dig through the Geom with getGeom(i), getGeomState(i), getVertexData(), getPrimitive(i), and GeomVertexReaders. In interactive testing, Every time I click the mouse, it correctly discards the previous GeomNode (if any), calls my Geom construction methods with some random coordinates, and reparents the resulting GeomNode to render. The constructed geometry displays fine.

When I copied the methods to a large ShowBase app, the Geom construction works as expected, but when I reparent the first resulting NodePath to render, the app locks up. If I comment-out the nodePath.reparentTo(render), the app continues to display its virtual reality including constructing additional Geoms as required (as long as it doesn’t try to display them).

What’s significantly different about the large app?

It includes an additional task chain (with 2 threads) in which a handful of tasks use the requests library to read data from APIs for the app to present in the environment. The requests library uses urllib3, which uses Python’s standard threading module. This aspect has been working fine and continues to work fine, as long as I don’t reparent the constructed GeomNode to render. I presume that, although direct.stdpy.threading[2] is preferred, the standard Python threading should work reliably with Panda3d.
It typically includes several GB of loaded textures and GeomNodes. However, when running with a simple environment of less than 100MB of such, the freeze still happens at the first reparentTo(render).
It uses popular libraries such as argparse, csv, datetime, json, math, os, pytz, re, requests, and time.
One class has a class threading.Lock and an instance threading.Lock to coordinate critical sections between asynchronous tasks in the chains. However, the locks are not used in executing the new methods.
The test program has only the main task chain. In the large app I generate the Geom in tasks scheduled in the additional task chain. I tried having this task leave the result GeomNode for a task in the main task chain to do the reparentTo, but this produces the same result – the first reparentTo starts the freeze.

In the freeze:

the app’s GPU usage drops from 10% to zero and CPU usage drifts down over several seconds from 3-4% to almost 0%
the executing task on the additional chain continues to run (GETing APIs). On completion it is never rescheduled
The Panda display window becomes non-responsive. e.g. can’t directly close it (but can from PyCharm)

Windows 10, PyCharm 2021.1, Python 3.8, Panda3d 1.10.8

Anyone seen this before and have suggestions as to where to look?

Simulan · May 1, 2021, 6:37pm

There seem to be a lot of moving parts, so I would suggest something simple. If you can, try disabling threading on your requests library to see if that makes a difference.

What are your load_prc_file_data config options?

logos · May 1, 2021, 11:24pm

my config settings are:
loadPrcFileData(‘my_app’, ‘clock-mode limited’)
loadPrcFileData(‘my_app’, ‘clock-frame-rate 30.0’)
loadPrcFileData(‘wirelessMap’, ‘loader-num-threads 2’)

It makes sense to look at task scheduling, threading, events, etc, because the symptoms point to some type of deadlock. So, I too started to investigate the requests library and its dependencies use of threading as a problematic interaction with Panda3d. So far, I’ve modified 4 uses of threading in requests proper; verified that 27 of the its imported external libraries are threading-free; identified 7 of its imported external libraries that use threading; and have a handful yet to investigate. If I continue on this hypothesis that threading is the culprit and direct.stdpy.threading is the solution, I’d need to edit and rebuild a fairly wide and deep ecosystem of libraries. It looks like a lot of work, opportunities to make mistakes in unfamiliar asynchronous code, and commitment to maintenance of private forks. My understanding is that Pand3d is compatible with Python threading, so apriori, the work seemed unlikely to solve the problem.

I’ve used the request library for years in unrelated production code that continues to be rock-solid through lots of external chaos. I’ve been successfully using it in Panda3d for about 6 months with as many as 10 concurrent asynchronous streams of https GET running for hours on deployments to Windows, AWS Windows, and Linux.

rdb · May 10, 2021, 9:06am

Are you setting threading-model with Panda3D? It may be a deadlock in Panda. We have seen these kinds of issues before, but I thought them fixed in the latest stable version. You can try upgrading to 1.10.9 to see if that makes a difference.

These are usually fairly easy to find, but does require someone to poke around in your application using a C++ debugger to get a stack trace of the two deadlocked threads.

logos · May 10, 2021, 5:00pm

I tried threading-model cull/draw at one point a few months ago, but found it unnecessary for the performance that I wanted. So, no, I’m not using threading-model.

I’ve been testing with 1 building in my scene and a request to my task on the 2nd taskchain to construct an additional Geom for the building. It executes 4 GETs, constructs, a Geom, then reparents NodePath(Geom) to render.
I was getting the deadlock with running 1.10.8. I this morning updated to 1.10.9 and still reliably get a deadlock on the before Panda3d displays the created Geom.

Accidentally, I tested in a scene with 7 buildings, thus 7 Geoms to construct. For each, the task on the 2nd taskchain issues a handful of GETs, constructs the Geom, then reparents it to render, repeats for the next building. In my log I can see that it completed reparenting all 7 over a period of about 5-10 seconds. Interestingly each time I run this case: Panda3d becomes unresponsive to the mouse for about 5 seconds, then manages to display the 1st 5 of 7 Geoms before deadlocking

rdb · May 11, 2021, 9:55am

To investigate this further, I need to know where in Panda this problem occurs. I think that either requires me to have the code so I can reproduce the deadlock on my computer, or you need to use a debugger to get me a stack trace of the deadlocked threads.

logos · May 12, 2021, 4:28pm

I’d need to get permission to release the code, and I’d need to build you an account with permission to access the data that the app GETs. It’s probably better to run the debugger with the code on my workstation. I use the Python debugger, but I’m guessing that you need a lower layer, e.g. C+, debugger. Please point me to whatever tool(s) you need so that I can prepare them and either trigger the stack trace or be your hands in triggering the stack trace.

logos · May 12, 2021, 9:12pm

I received permission to share the code, and I can arrange for an account for you to access the data while replicating and debugging. So, if you prefer, I can send you a distribution.