How to Catch a Crash?

So, just earlier I encountered a disturbing occurrence: Loading a save in my current project–something that usually works reliably–was met with a crash, and a report of “exit code 139”.

(Which I believe indicates an invalid memory access.)

Subsequent loads have not reproduced the error.

Now, I’ve encountered something similar before, in a previous project, as mentioned in this post.

But there the crash seems to have been consistent–here it happened just once, seemingly out of nowhere.

Further, it looks like I deleted the test-case that I came up with in that thread, and I don’t remember it offhand, meaning that I can’t test whether it’s the same issue. (I tried to create a similar program based on what I wrote there, to no avail.)

It’s tempting, then, to ignore the crash: it has happened just once. But conversely, if it’s happened once, it may well happen again–and potentially while someone is playing it, or when a potential publisher is testing it…

So! Does anyone have any ideas as to how I might go about uncovering the issue…?

[edit]
*Bump* Any thoughts…?

This is a pretty extensive problem. I think you can study the logs to begin with, as indicated here.

Thank you for that! :slight_smile:

Hmm… Based on the article, I’ve had a look at my “/var/log/syslog” file, and I found the following entry for what seems like it might be the right time and date:

Jun 24 11:01:51 ThaumUbuntu kernel: [ 7397.044452] python3.10[9255]: segfault at 40 ip 00007fcf73f81920 sp 00007ffca039df98 error 4 in libpanda.so.1.10[7fcf73e8d000+96b000]

I’m not clear on whether that provides much to go on: it indicates an engine-level crash–but I’d guessed as much, since on my end I’m using Python, which is unlikely to cause a memory-fault.

Unless the numbers included specify a location that might be usefully looked up in the code…?

You can look into Linux core dumps. I don’t remember if they’re usually enabled by default, how long they’re retained, or how to view them (maybe gdb?). Unfortunately, the lack of debug symbols for Panda may make it hard to get something useful out of the core dump.

Well, if I’m understanding this AskUbuntu thread correctly, then it looks like I don’t have core dumps enabled. :/

(And indeed, the core-dump folder given in that same thread is empty.)

(That is, running “ulimit -c” prints a value of “0”, apparently indicating that core dumps won’t be generated.)

Hmm… Troublesome!

I suppose, then, that there’s little for it but to leave the matter until another crash pops up, at which point perhaps I can determine what specific set of circumstances is prompting it.

If I can do that, then I can potentially create a reproducible test case… Until such potential eventuality, I see little means of investigating further…

You could see about enabling core dumps for if you encounter the crash again.

I have been hesitant to do that, to be honest.

But I’ve done it now, I believe.

(I’ll just want to keep an eye on the associated folder every so often, against non-dev-related dumps filling it up unseen…)

Okay, I think that I managed to get something!

To be specific, the crash happened again–and this time I had core dumps enabled.

Accessing the dump and running it through gdb, I get the following:

Core was generated by `./Moons in Crystal'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f86db992060 in NodePath::find_common_ancestor(NodePath const&, NodePath const&, int&, int&, Thread*) ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
[Current thread is 1 (Thread 0x7f86ddfcf180 (LWP 10850))]

So it looks like the issue is occurring within the “find_common_ancestor” method of NodePath, at address 0x00007f86db992060.

Is there a way to associate that address with a line of code in the engine…?

(I’ll note that it doesn’t look like I’m calling the method myself–I can only guess that it’s something being done by the engine.)

I do note that this crash seems to happen when I re-enter a “world”. I’m wondering, then, whether it might be related to the fact that, when this happens, I start a thread by which to repopulate the world with enemies…

(If so, perhaps it might be better to have this be non-threaded and just accept the slight loading-hit.)

I just realised today that I forgot to query gdb for a backtrace. Doing so now reveals the following:

(gdb) bt
#0  0x00007f86db992060 in NodePath::find_common_ancestor(NodePath const&, NodePath const&, int&, int&, Thread*) ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#1  0x00007f86db9a25ca in NodePath::get_transform(NodePath const&, Thread*) const ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#2  0x00007f86dbb2fd97 in CollisionEntry::get_wrt_mat() const ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#3  0x00007f86dbb45a8a in CollisionPolygon::test_intersection_from_line(CollisionEntry const&) const ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#4  0x00007f86dbb2fe22 in CollisionLine::test_intersection(CollisionEntry const&) const ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#5  0x00007f86dbb30cb0 in ?? ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#6  0x00007f86dbb37db4 in CollisionTraverser::compare_collider_to_node(Collision--Type <RET> for more, q to quit, c to continue without paging--
Entry&, GeometricBoundingVolume const*, GeometricBoundingVolume const*, GeometricBoundingVolume const*) ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#7  0x00007f86dbb4f680 in ?? ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#8  0x00007f86dbb4efa0 in ?? ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#9  0x00007f86dbb51753 in CollisionTraverser::traverse(NodePath const&) ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#10 0x00007f86dcc672d5 in ?? ()
   from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/panda3d.core.so
#11 0x00000000004de3cd in ?? ()
#12 0x000000000042468e in _PyEval_EvalFrameDefault ()
#13 0x000000000058ca94 in ?? ()
#14 0x000000000042468e in _PyEval_EvalFrameDefault ()
#15 0x000000000058ca94 in ?? ()
#16 0x000000000042468e in _PyEval_EvalFrameDefault ()
#17 0x000000000058ca94 in ?? ()
--Type <RET> for more, q to quit, c to continue without paging--c
#18 0x000000000042468e in _PyEval_EvalFrameDefault ()
#19 0x000000000058ca94 in ?? ()
#20 0x000000000042468e in _PyEval_EvalFrameDefault ()
#21 0x000000000058ca94 in ?? ()
#22 0x00000000004267e0 in _PyEval_EvalFrameDefault ()
#23 0x000000000058ca94 in ?? ()
#24 0x00000000004267e0 in _PyEval_EvalFrameDefault ()
#25 0x000000000058ca94 in ?? ()
#26 0x000000000042468e in _PyEval_EvalFrameDefault ()
#27 0x000000000058ca94 in ?? ()
#28 0x00000000004d755b in ?? ()
#29 0x00007f86dcfea765 in ?? () from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/panda3d.core.so
#30 0x00007f86dcff0417 in ?? () from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/panda3d.core.so
#31 0x00007f86dcff0df9 in ?? () from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/panda3d.core.so
#32 0x00007f86dbbc1749 in AsyncTask::unlock_and_do_task() () from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#33 0x00007f86dbbcb1ca in AsyncTaskChain::service_one_task(AsyncTaskChain::AsyncTaskChainThread*) () from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#34 0x00007f86dbbcb5df in AsyncTaskChain::do_poll() () from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#35 0x00007f86dbbcbb58 in AsyncTaskManager::poll() () from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/libpanda.so.1.10
#36 0x00007f86dccce7c3 in ?? () from /home/thaumaturge/Desktop/mew/Moons in Crystal-0.7.5_manylinux2014_x86_64/Moons in Crystal/panda3d.core.so
#37 0x00000000004df00b in ?? ()
#38 0x000000000042468e in _PyEval_EvalFrameDefault ()
#39 0x000000000058ca94 in ?? ()
#40 0x000000000042468e in _PyEval_EvalFrameDefault ()
#41 0x000000000058ca94 in ?? ()
#42 0x000000000042468e in _PyEval_EvalFrameDefault ()
#43 0x000000000058ca94 in ?? ()
#44 0x000000000042468e in _PyEval_EvalFrameDefault ()
#45 0x000000000058ca94 in ?? ()
#46 0x000000000058cc4b in PyEval_EvalCode ()
#47 0x000000000042bf16 in ?? ()
#48 0x000000000042d592 in PyImport_ImportFrozenModuleObject ()
#49 0x000000000042d7cc in PyImport_ImportFrozenModule ()
#50 0x0000000000427ed0 in Py_FrozenMain ()
#51 0x000000000041d92a in ?? ()
#52 0x00007f86ddff9d90 in __libc_start_call_main (main=main@entry=0x41d740, argc=argc@entry=1, argv=argv@entry=0x7ffe5d9f0918) at ../sysdeps/nptl/libc_start_call_main.h:58
#53 0x00007f86ddff9e40 in __libc_start_main_impl (main=0x41d740, argc=1, argv=0x7ffe5d9f0918, init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>, stack_end=0x7ffe5d9f0908) at ../csu/libc-start.c:392
#54 0x0000000000427ad5 in _start ()

So, it looks like the crash is occurring due to some action of the traversal system.

Of interest is that it appears to involve the CollisionLine class–and I’m using that in only two places, both within the same class and for the same essential purpose.

(That being to detect which room an object or point lies within, which is achieved by colliding a vertical CollisionLine against the CollisionPolygons that represent the room’s floorplan.)

Now, that still may be in some way related to my repopulating the world with enemies.

That said, the actual traversals performed by the class that is presumably using these CollisionLines should not be threaded, and the collision-objects are managed by that class.

However, it does occur to me that the method that sets up the collision-objects may well be called from within a thread. (As it’s done as part of object-spawning.)

Could it be that, if the timing is just wrong, and a collision-object is added during traversal, that causes the crash? Maybe due to the object not yet being fully set up or attached, or some such…?

(Also, with the new information recently discovered I’ve tried again to create a game-state that reliably reproduces the crash–thus far to no avail.)

If it _is_related to thread-prompted addition of a collision-object during traversal, can anyone think of a way to create a reliable test-program for that…?

(I mean, creating a program that adds collision-objects in a thread is easy enough, I daresay–it’s getting their addition to coincide with non-threaded traversal that I’m uncertain about…)

[edit]
Okay, I think that I’ve created a test-program that reliably segfaults in a manner that at least seems similar to what I’m observing in my main program.

The exact method in which the segfault occurs varies from run to run–but then I’ve only seen a stack trace for a single crash in my main program, and so don’t know whether it varies there, too. That said, I have seen it occur within “find_common_ancestor”–or, in this case, a method called by that method.

The test-program is as follows:

from panda3d.core import loadPrcFile, loadPrcFileData
loadPrcFileData("", "show-frame-rate-meter #t")
loadPrcFileData("", "frame-rate-meter-milliseconds #t")

from direct.showbase.ShowBase import ShowBase

from panda3d.core import CollisionNode, CollisionLine, CollisionTraverser, CollisionHandlerQueue, BitMask32
from direct.stdpy import threading

from panda3d import __version__ as pandaVersion
print (pandaVersion)

import sys
print (sys.version)

class TestThread(threading.Thread):
    def __init__(self, game):
        threading.Thread.__init__(self)
        self.game = game
    
    def run(self):
        print ("Running")
        while (self.game.runThread):
            if self.game.line is None:
                self.game.addLine()
            else:
                self.game.removeLine()


class Game(ShowBase):
    def __init__(self):
        ShowBase.__init__(self)
        
        self.generalTraverser = CollisionTraverser()
        self.line = None
        self.queue = None
        
        self.taskMgr.add(self.update, "update")
        
        self.testThread = TestThread(self)
        self.testThread.start()
        
        self.accept("space", self.stopThread)
        self.runThread = True
        
    def stopThread(self):
        self.runThread = False
        
    def update(self, task):
        self.generalTraverser.traverse(render)
        
        if self.queue is not None:
            print (self.queue.getNumEntries())
        
        return task.cont
    
    def addLine(self):
        ray = CollisionLine(0, 0, 0, 0, 0, -1)
        rayNode = CollisionNode("room detector")
        rayNode.addSolid(ray)
        rayNode.setIntoCollideMask(0)
        mask = BitMask32(1)
        rayNode.setFromCollideMask(mask)
        rayNP = render.attachNewNode(rayNode)

        queue = CollisionHandlerQueue()

        self.generalTraverser.addCollider(rayNP, queue)
        
        self.line = rayNP
        self.queue = queue
        
        #print ("Adding line")
        
    def removeLine(self):
        if self.line is not None:
            self.line.removeNode()
            self.generalTraverser.removeCollider(self.line)
            self.line = None
        if self.queue is not None:
            self.queue = None
            
        #print ("Removing line")

app = Game()
app.run()

I get this error by running your code.

1.10.15.dev25
3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]
Known pipe types:
  wglGraphicsPipe
(all display modules loaded.)
Running
0
0
0
0
0
0
0
0
0
0
Assertion failed: !a.is_empty() && !b.is_empty() at line 5940 of panda/src/pgraph/nodePath.cxx
Assertion failed: !a.is_empty() && !b.is_empty() at line 5940 of panda/src/pgraph/nodePath.cxx
Assertion failed: !a.is_empty() && !b.is_empty() at line 5940 of panda/src/pgraph/nodePath.cxx
Assertion failed: !a.is_empty() && !b.is_empty() at line 5940 of panda/src/pgraph/nodePath.cxx
:thread(error): Exception occurred within PythonThread Thread-1
Traceback (most recent call last):
  File "D:\Panda3D-1.10.15-x64\direct\stdpy\threading.py", line 113, in call_run
    self.run()
  File "main.py", line 25, in run
    self.game.addLine()
  File "main.py", line 68, in addLine
    self.generalTraverser.addCollider(rayNP, queue)
AssertionError: !a.is_empty() && !b.is_empty() at line 5940 of panda/src/pgraph/nodePath.cxx

Intriguing! I don’t get that error at all!

Hmm… I see that you’re running 1.10.15–a development build. I’m still on 1.10.14–a stable build. Perhaps this reflects a guard implemented between the two versions?

[edit]
Although it’s odd that it’s finding one or the other to be invalid while adding a NodePath. Could it be that the call to “addCollider” (or some part thereof) is done outside the thread, meaning that the thread continues in the meanwhile, and so removes the NodePath before “addCollider” is done…?

Also, I note that the assertion in question is, once again, in “find_common_ancestor”.

I’m definitely thinking that I might want to rework my enemy-repopulation code to be non-threaded…

[edit 2]
Hmm… I’m not sure of which was the last commit for 1.10.14, but picking a plausible-enough date of November last year, I don’t see a change to a likely-looking file for there to be such a difference…

Perhaps we just had different timing on our machines…?

If I remember well, this assembly was provided to me by rdb, to test the commit - fix the getter of the current window size.

1.10.14
3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]
Known pipe types:
  wglGraphicsPipe
(all display modules loaded.)
Running
0
Assertion failed: !is_empty() at line 228 of c:\buildslave\sdk-windows-amd64\build\built1.10\include\nodePath.I
:thread(error): Exception occurred within PythonThread Thread-1
Traceback (most recent call last):
  File "D:\Panda3D-1.10.14-x64\direct\stdpy\threading.py", line 113, in call_run
    self.run()
  File "main.py", line 25, in run
    self.game.addLine()
  File "main.py", line 70, in addLine
    self.generalTraverser.addCollider(np, queue)
AssertionError: !is_empty() at line 228 of c:\buildslave\sdk-windows-amd64\build\built1.10\include\nodePath.I
1.10.14
3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]
Known pipe types:
  wglGraphicsPipe
(all display modules loaded.)
Running
0
Assertion failed: !a.is_empty() && !b.is_empty() at line 5940 of panda/src/pgraph/nodePath.cxx
:thread(error): Exception occurred within PythonThread Thread-1
Traceback (most recent call last):
  File "D:\Panda3D-1.10.14-x64\direct\stdpy\threading.py", line 113, in call_run
    self.run()
  File "main.py", line 25, in run
    self.game.addLine()
  File "main.py", line 70, in addLine
    self.generalTraverser.addCollider(np, queue)
AssertionError: !a.is_empty() && !b.is_empty() at line 5940 of panda/src/pgraph/nodePath.cxx

Of course I have a stable version, but the error may change sometimes. The very fact that I see an error may be related to another processing of the order of the threads in the window. Thus, if an error occurred in the stream, then the other stream is not blocked, somehow, I am not able to explain it more precisely.

Hmm… I’m still intrigued that you’re seeing different errors to me–and that it seems so much more stable for you than for me!

(I mean, it’s still crashing for you, but that’s a bit more stable than an outright core dump!)

Maybe it’s a platform difference? I am on Ubuntu Linux, and you seem to be on Windows–maybe there’s a difference in the code path, or the way that the engine is compiled on each, or the way that threading is handled between them…

Anyway, I’ve decided now that my best course–for the moment, at least–seems to be to drop the threading from this feature.

I’ve done that, and it seems to work–granted that there’s now a minor hitch that was previously covered by threading.

Of course, we won’t know whether it’s actually solved the problem until either the problem never occurs again–or it does…

For the experiment, I added a print operator for the rayNP node. The error is different…

render/room detector
render/room detector
Assertion failed: !update_bounds || _cdata->_last_bounds_update == _cdata->_next_update at line 4164 of c:\buildslave\sdk-windows-amd64\build\panda\src\pgraph\pandaNode.cxx
:thread(error): Assertion failed: _cdata->_last_update == _cdata->_next_update at line 1271 of c:\buildslave\sdk-windows-amd64\build\built1.10\include\pandaNode.I
Exception occurred within PythonThread Thread-1
Traceback (most recent call last):
  File "D:\Panda3D-1.10.14-x64\direct\stdpy\threading.py", line 113, in call_run
    self.run()
  File "main.py", line 25, in run
    self.game.addLine()
  File "main.py", line 61, in addLine
    rayNode.setIntoCollideMask(0)
AssertionError: !update_bounds || _cdata->_last_bounds_update == _cdata->_next_update at line 4164 of c:\buildslave\sdk-windows-amd64\build\panda\src\pgraph\pandaNode.cxx
Traceback (most recent call last):
  File "D:\Panda3D-1.10.14-x64\direct\showbase\ShowBase.py", line 2158, in __igLoop
    self.graphicsEngine.renderFrame()
AssertionError: _cdata->_last_update == _cdata->_next_update at line 1271 of c:\buildslave\sdk-windows-amd64\build\built1.10\include\pandaNode.I
:task(error): Exception occurred in PythonTask igLoop
Traceback (most recent call last):
  File "main.py", line 88, in <module>
    app.run()
  File "D:\Panda3D-1.10.14-x64\direct\showbase\ShowBase.py", line 3331, in run
    self.taskMgr.run()
  File "D:\Panda3D-1.10.14-x64\direct\task\Task.py", line 553, in run
    self.step()
  File "D:\Panda3D-1.10.14-x64\direct\task\Task.py", line 504, in step
    self.mgr.poll()
  File "D:\Panda3D-1.10.14-x64\direct\showbase\ShowBase.py", line 2158, in __igLoop
    self.graphicsEngine.renderFrame()
AssertionError: _cdata->_last_update == _cdata->_next_update at line 1271 of c:\buildslave\sdk-windows-amd64\build\built1.10\include\pandaNode.I

Well, I’ve had the exact location of the segfault (as it is on my end) vary from run-to-run.

The fact that such an addition as you made likewise varies the location of the issue does seem to support the idea that it’s a threading issue: the location at which it causes problems depends on just where the main program is when the thread does something bad, I’d guess.