Running Panda3d in multiple cores

david_ragazzi · September 20, 2020, 8:51pm

Hi everyone,

I’d like know your suggestions on which approach could best fit my problem related to performance once I’m newbie to multi-threading and multi-processing.

My last machine (an intel i5 with 2 cores and integrated graphics card) was taking several seconds to perform a single time step (taskMgr.step()) even with several optimizations. All my tricks were used but still so performance was not the ideal. Then last month, in order to get things faster I decided buy a “monster” machine to simulate my AI things. One of the objectives would be run my simulation all the time (24/7). To be honest the things got better but not better enough with the new machine. I suspect the main reason is that my program is using only one single processor/thread instead of the several available processors because GIL among others.

Basically my program consists get information of cameras, bullet collisions, etc, to feed a neural network with sense of vision, touch, etc. Each instance sense is represented by a node which has class to handle the sense stuff. Example:

class Eye:
    def __init__(self, args):
        # Configure eye parameters like name, relative position, etc
        # Create a camera spot to the scene

    def process(self):
        # Get camera spot's rendered image representing the image reaching the retina

class Skin:
    def __init__(self, args):
        # Configure skin parameters like name, relative position, etc
        # Create feelers (neurons to detect the depth of the touch using bullet ray collision)

    def process(self):
        # Get which feelers were touch by an object on scene. A skin has several feelers,
        # thus it's necessary use rayTestAll(origin, tip) for each feeler detect the touch depth
        # and this is slow and there's no alternatives!
      
eye1 = Eye(...)
eye2 = Eye(...)
skin1 = Skin(...)
skin2 = Skin(...)
sense_nodes = [eye1, eye2, skin1, skin2]

class Simulation(ShowBase):
    def __init__(self):
        ShowBase.__init__(self)
        loadPrcFileData("", "threading-model Cull/Draw")  # I feel no performance difference with this option. Is it enabled by the default?

        self.physics_manager = BulletWorld()
        self.start = time.time()
        self.step = 1
        self.taskMgr.add(self.update, "update")

    def update(self, task):
        self.updateCamera()
        self.physics_manager.doPhysics(globalClock.getDt())

        for node in sense_nodes:
            node.process()

        # My attempt using python multiprocessing. It didn't work. :-(
        #processes = []
        #for node in self.sense_nodes:
        #    p = multiprocessing.Process(target=node.process)
        #    processes.append(p)
        #    p.start()
        #for process in processes:
        #    process.join()

        self.saveState(self.step)  # Save the objects states to "disc"
        if self.step == 50:
            print(time.time() - self.start)  # Print the time used for process 50 time steps
        self.step += 1

        return Task.cont

The new machine configuration:

    - 2 chips Xeon e5-2678 v3 (2.30 GHz with 12 cores and 24 threads)
    - 1 graphics card Radeon RX 570 8 Gb
    - 32 gb RAM (server memmory)
    - 1 SSD storage unit

I get 48 logical processors when I run:

import multiprocessing
multiprocessing.cpu_count()

Then the question is: which approach (taskChain, threading2, python multiprocessing, etc) I should use in order to get the best of 48 logical processors?? Remembering the data of each node doesn’t depend of other node data, i.e. they don’t need be synchronous, but each node needs get its processing done before the default taskManager iterate again (taskMgr.step()).

DylanCheetah · September 21, 2020, 3:37am

Any of the threading options will be limited by the GIL unless you call C code (the GIL gets released upon entering C code). According to all the Python documentation I’ve read over the years, you need to use either separate processes or C code in order to take full advantage of multi-core CPUs.

david_ragazzi · September 21, 2020, 11:59am

So… which is the objective of TaskChain?

rdb · September 21, 2020, 12:13pm

You’re setting threading-model too late. It needs to be set before initializing ShowBase.

Setting threading-model will tell Panda to use two additional cores for parts of the rendering process. If you want to utilize more cores, you need to use some combination of asynchronous operations (explained in the manual), threading (though subject to the limitations of the GIL unless most of their time is spent making I/O calls that release the GIL), custom C++ extensions (to avoid the GIL entirely, at expense of needing to use synchronization primitives yourself), or multiprocessing (which avoids the GIL too, but doesn’t share memory with the main process).

Task chains are a convenient high-level interface around threading. Unless you are implementing your tasks in C++, it doesn’t inherently avoid the problems of the GIL.

david_ragazzi · October 30, 2020, 1:12pm

For who is interessed: finally, I was able to use parallelism on my project when I ported it to Cyhton and used prange function which can disable the gil to run the same function in multiple cores. The whole process is giving some work, but it is painless until moment, because I’m porting only what was bottleneck (the rest of code remained in pure python). In addition to parallelism I also had performance gains with c/c++ typed variables thanks to Cython. I recommend cython for everyone that wish performance gains in his projects!