Threading -- How to squeeze more power out of panda

A small summary on Threading – How to squeeze more power out Panda3d

This is a small primer on how to squeeze more CPU cycles out of a Panda3d program

that is being driven from Python. There are lot of information scattered across
the forum on how to setup threading and taskchains, but there is very little information
on the idiosyncrancies of python threading which is important to know if you really
want to squeeze more CPU cycles out of Panda3d.

First, let's talking about OS threading in C/C++ which is the most basic level of threading.

OS threads allow you to run two blocks of C/C+= code in parallel on multiple CPU cores.
If your code is entirely C/C++ you can generally schedule the workload however you like.
This includes instancing the same block of C++ code twice on multiple CPU cores.

<All C code, 2 Threads >

         Core 1              Core 2             Time
    Thread1 -- C Code    Thread2 -- C Code       |
    Thread1 -- C Code    Thread2 -- C Code       |
    Thread1 -- C Code    Thread2 -- C Code       V
Python is essentially a C program. However, it's a special C program that can

only be instanced once, no matter how many threads are used to run it. For example,
if I have 2 C/C++ thread trying to run python code, whenever either thread is accessing
the python interperter to run python code, the other thread may not access the python
interpter. This is the infamous python GIL. And the reason why python is usually
thought of as a single-threaded application.

Thus, if you're entire codebase is all python, there is no advantage in running

multiple threads. In fact, your code will be ~2x as slow because of thread contention
issues.

<All python code, 2 Threads>

                                                Time
          Core 1         Core 2                   |
    Thread1 - Python                              |
                      Thread2 Python              |
    Thread1 - Python                              |
                      Thread2 Python              V

*The cost of a Thread 2 wrestling the python interpeter from thread 1 python
is extremely costly because python’s thread scheduling is poorly designed.

However, there is a special circumstance where threading under python is beneficial.

That is, when there a lot of C/C++ code interlaced with python code. When a thread
runing python code hits a python function which calls C/C++ block, that thread has the
option of relinquishing control of the python interpeter (giving up the GIL)
while it’s running the C/C++ code and allow another thread to run some python code.
This effectively allows the python interpeter to be run ‘twice’ as much as you would
expect. Note, that not every C/C++ code is written to give up the python GIL. It must
be explicitly programmed to do so. (The function call to do this is trivial though.)

          Core 1            Core 2
    Thread1 - Python                            Time
              C/C++      Thread2 Python           |
    Thread1   Python                              |
              C/C++      Thread2 Python           V


    In this last scenario, Core 1 is always fully occupied, while Core 2 can 
maintain ~50% occupancy. 
In Panda, the actual rendering is done all in C/C++. Usually, you use python

to setup the scene, and then (behind the scenes) call graphicsEngine.renderFrame(),
a C++ function to do the actual rendering. If you have a lot of shaders + models in
your scene, your FPS maybe very low and most of the time is spent in C++ uploading
data to the GPU or waiting for the GPU to render. This time is a great opportunity
to run some game/AI code in a second python thread (using threading or a taskchain with
numthreads > 0).

        Core 1                                Core 2
   Thread1  - Python 1ms (setup scene)
              C/C++ 16ms (renderFrame)      Thread2 Python ( AI/Code )  
--Frame 0 End------------------------------------------ 17ms/Frame
   Thread1  - Python 1ms (setup scene)
              C/C++ 16ms (renderFrame)      Thread2 Python ( AI/Code )
--Frame 1 End---------------------------------------------------------------

   If you don't use a second thread to run your AI/code, and/or use a normal task,
then the time distribution would look like:

   Thread1  - Python 1ms (setup scene)
              C/C++ 16ms (renderFrame)      
              Python 16ms ( AI/Code) 
--Frame 0 End------------------------------------------ 32ms/Frame
Thus, using a second thread could effectively double your game speed, even if you

write all of your game code in python. The situation depicted above is ideal, and for my
own situation, the gain is closer 50%. ie., No threads uses 25% CPU while a secondary thread
boosts CPU utilization to 40%.

If you take this dual threading path, there are some additional caveats to worry about.

As I hinted early, 2 threads trying to acquire control of the python interpeter is very
costly and you may have a situation like this

   Thread1  - Python 0.1ms (setup scene)
                                          Thread2 Python ( AI/Code )
              Python 0.1ms (setup scene)
                                          Thread2 Python ( AI/Code )
                      ....
          Each switch of which thread controls the python interperter may cost .1ms
                      ....
               C/C++ 16ms (renderFrame) 

--Frame 0 End------------------------------------------ 27ms/Frame
Ideally, we would schedule Thread2 python's code to run only after Thread1's python

code is completely finished. We can manipulate python schedules in at least two ways:

a) when it hits a C/C++ block that gives up the GIL ie., during rendering
b) after a set number of python op codes (related to # of statements) as set by
sys.setcheckinterval(5000)

The default setting of setcheckinterval is I belive 100. Which leads to a lot of ping-ponging

between different threads trying to acqure the interpeter and losing a lot of time doing this.
Setting setcheckinterval to a very high number eg., 50000 essentially tells the interpeter,
“do not try to give up control of python until you a hit C/C++ block which gives up the GIL”

If you naively set this too high however, the AI/code thread may not relinquish  

control back to the rendering thread fast enough and thus slow down the rendering.
Alternatively, one could sprinkle some dummy C/C++ blocks that gives up the GIL in the AI code
to judiciously force a switch.

In all likelihood however, your AI code will also rely on some C/C++ for things

like pathfinding …etc, where opportunities naturally exist for the GIL to be given up.

How many threads should one use? It depends on how many elements of code gives up

the GIL. You should have as many threads as code modules that gives up the GIL + 1. In the example
code above, you should only use two threads as only the renderer gives up the GIL.
If you have a separate module which perhaps does some facial recognition and image proceessing
in C, that would be a candidate to be put into its own thread. Physics, should also in its own
thread. (Does the current bullet/ode implementation release the GIL? I’ll have to check the source
code!)

Well, anyway, this is all of my thoughts on this subject. In the end, Python is a great

language for debugging, prototyping, meta-programming…etc. Hopefully, this overview of threading
will be useful to people who wants to squeeze more CPU cycles out of their Python code.

Very interesting article. I hope you add some practical approaches aka. code snippets or possibly even some stats.

Keep up the good work!

A little hint: You would have to check the generated code. All Python/C++ interaction is code generated during the build process when invoking interrogate & interrogate_module.

I felt like checking whether or not this is the case still. It seems to be so. Some code here for the threaded variant of my check:

import random
from math import sqrt
from direct.stdpy import threading
from timeit import default_timer as timer
from direct.showbase.ShowBase import ShowBase
from multiprocessing import cpu_count


class MyApp(ShowBase):
    def __init__(self):
        ShowBase.__init__(self)
        start = timer()

        np = cpu_count()
        print(f'You have {np} cores')

        n = 100_000_000

        part_count = [n/np for i in range(np)]
        count=0

        threads = []
        for i in range(np):
            threads.append(threading.Thread(target=pi_part, args=(part_count[i],)))
            threads[-1].start()

        for i in range(np):
            threads[i].join()

        end = timer()

        print(f'elapsed time: {end - start}')


def pi_part(n):
    print(n)

    count = 0

    for i in range(int(n)):

        x, y = random.random(), random.random()

        r = sqrt(pow(x, 2) + pow(y, 2))

        if r < 1:
            count += 1

    return count


app = MyApp()
app.run()

why is n 100 mil.? 10 mil. is long enough. I thought its stuck. But yeah, 1 core was the fastest.

edit: I looked at some python threading tutorials and Im pretty sure you need to use multiprocessing method. Then you should get better results with more cores.

import random
from math import sqrt
import multiprocessing
from timeit import default_timer as timer
from direct.showbase.ShowBase import ShowBase
from multiprocessing import cpu_count


class MyApp(ShowBase):
    def __init__(self):
        ShowBase.__init__(self)
        start = timer()
        np = cpu_count()
        print(f'You have {np} cores')
        n = 10_000_000
        part_count = [n/np for i in range(np)]

        with multiprocessing.Pool() as pool:
            pool.map(pi_part, part_count)

        end = timer()
        print(f'elapsed time: {end - start}')

def pi_part(n):
    print(n)
    count = 0
    for i in range(int(n)):
        x, y = random.random(), random.random()
        r = sqrt(pow(x, 2) + pow(y, 2))
        if r < 1:
            count += 1
    return count

app = MyApp()
app.run()