Questions on proper use of panda task-chain threading

Hi,
I’d like to bring up an issue I’m encountering with threading through task_chains in Panda.

I have an Panda C++ app that is intensively using video, sounds, actor animation, network (UDP) and Bullet physical modelling.

In order to increase substantially the FPS (ie from roughly 8fps to more than 50fps), I’ve been using task_chains.

Digging into some unexpected behaviour in Bullet (ie some lag issue in objects movement,…), I investigated a little bit more the cycle time of each task_chains.

Here is a sketch of the code:

	load_prc_file_data("", "lock-to-one-cpu 0"); // unlocks threading possiblity
	load_prc_file_data("", "support-threads 1");
	// we'll use 3 task chains
	// ----------------------- 
	// use task 1 for various house keeping (effects,...)
	
	// set up a new task chain to be used in thread mode for audio & video processing:
	task_chain_2 = taskMgr->make_task_chain("task_chain_2");
 	task_chain_2->set_num_threads(1);
	task_chain_2->set_frame_sync(true);

	// set up a new task chain to be used for collisions and actor update:
	task_chain_3 = taskMgr->make_task_chain("task_chain_3");
 	task_chain_3->set_num_threads(1);
	task_chain_3->set_frame_sync(true);
	
	// assign to do asynchonous tasks and dispatch them among the 3 task chains
	
	// tasks to be performed under task_chain_1
	update_clouds_task = new GenericAsyncTask("UpdateClouds", &UpdateClouds, NULL);
   	taskMgr->add(update_clouds_task);
   	... + add others tasks
   	
	// tasks to be performed under task_chain_2
	audio_task = new GenericAsyncTask("AudioUpdate", &AudioTask, (void*) AM);
	audio_task->set_task_chain("task_chain_2");
	taskMgr->add(audio_task);
   	... + add others tasks
   	
	// tasks to be performed under task_chain_3 (including BULLET processing)
	coll_task = new GenericAsyncTask("Collisions_update", &CollisionsUpdate, NULL);
	coll_task->set_task_chain("task_chain_3");
	taskMgr->add(coll_task);
   	... + add others tasks

Let’s now try to monitor the ordering and the cycle time of each chain

// add 3 spy task to each task chain (just to check)
AsyncTask::DoneStatus SPY_1(GenericAsyncTask* task, void* data){
	double LastTimeExerciced = 0.0L;	
	double CurrTime = globalClock->get_real_time();
	double dt = CurrTime - LastTimeExerciced;
	LastTimeExerciced = CurrTime;
	std::cout << "## TASK 1 re-exerciced after " << dt << " seconds\n";
	return AsyncTask::DS_cont;
	}
	PT(GenericAsyncTask) spy1_task = new GenericAsyncTask("spy1_ck", &SPY_1, NULL);
	taskMgr->add(spy1_task);

	PT(GenericAsyncTask) spy2_task = new GenericAsyncTask("spy2_ck", &SPY_2, NULL);
	spy1_task->set_task_chain("task_chain_2");
	taskMgr->add(spy2_task);

	PT(GenericAsyncTask) spy3_task = new GenericAsyncTask("spy3_ck", &SPY_3, NULL);
	spy3_task->set_task_chain("task_chain_3");
	taskMgr->add(spy3_task);

This gives (notice the ordering of the tasks):

## TASK 1 re-exerciced after 0.00924089 seconds
## TASK 2 re-exerciced after 0.00923157 seconds
## TASK 1 re-exerciced after 0.00816875 seconds
## TASK 2 re-exerciced after 0.00819397 seconds
## TASK 1 re-exerciced after 0.0102592 seconds
## TASK 2 re-exerciced after 0.0102539 seconds
## TASK 2 re-exerciced after 0.00984192 seconds
## TASK 1 re-exerciced after 0.0111123 seconds
## TASK 2 re-exerciced after 0.00886536 seconds
## TASK 1 re-exerciced after 0.0194617 seconds
## TASK 2 re-exerciced after 0.0118713 seconds
## TASK 1 re-exerciced after 0.00945009 seconds
## TASK 2 re-exerciced after 0.00939941 seconds
## TASK 3 re-exerciced after 0.239088 seconds
## TASK 1 re-exerciced after 0.0108739 seconds
## TASK 2 re-exerciced after 0.0109558 seconds
## TASK 1 re-exerciced after 0.0103214 seconds
## TASK 2 re-exerciced after 0.0102692 seconds

Now, not using threading kills the performance in a unacceptable way :cry:

So the issues are:
(1) how to insure data coherence between these panda threads?
(2) how to force sync? (maybe based on the slower panda thread?)
(3) overall how to take advantage of Panda threading and use it properly?

The first followup question to ask is, which version of Panda are you using? Panda version 1.7.2 and earlier were compiled by default to use a “simple threads” model, which actually doesn’t take advantage of operating system threading, but instead alternates tasks on the same CPU. The advantage to this model is that it runs 10% faster on a single-CPU computer than actual threads; the disadvantage is that it doesn’t take advantage of multiple CPU’s and will therefore usually not result in a performance benefit over not using threads in the first place. Another disadvantage is that you may need to insert explicit calls to Thread::consider_yield() if you want to yield the context in the middle of a task.

Panda version 1.8 (i.e. the buildbot version) is compiled by default to use the standard threading model, which means actually using operating system threading as you might expect.

Of course, if you are using your own Panda build, you can specify one or the other explicitly, regardless of the version.

You are responsible for ensuring data protection using, for instance, Panda’s Mutex and ConditionVar classes.

Not sure what you mean by forcing sync. Isn’t the point of threading that you want it to run asynchronously? If you mean you want thread A to wait for thread B to achieve some task before continuing, that’s the sort of thing ConditionVars are for; but this is a very big topic.

Again, a big topic. Panda’s threading library isn’t very different than any other threading library; and there are entire courses given in how to use threading properly.

David

1 Like

Thanks David,

(0) I’m using 1.8.0 build directly from the CVS. My config is Windows7/Intel i5 (4 core)

(1) In term of ensuring Data protection, the point is that I really don’t know what Panda’s data (except a few ones I’m creating in my c++ app) are likely to be “shared” between threads launched with ‘task_chains’. Some may be implicit in Panda engine? (ie I assume that some objects are inherently thread safe in Panda engine).
(ex. what about node position, textures, sounds buffers…)

(1’) In particular, besides running Bullet in a threaded way, I’ve found quite efficient to have a specific task_chain for CollisionTraverser

(1’’) specifically using bullet, I noticed that then the debug boxes are lagging behind nodes that are activated by actor animation

(2) Maybe here the way would merely to insure that on a frame basis ( task_chains 1 2 3 ) are running in parallel, but get synchronized at a frame level.

Is there any documentation to get a better understanding of panda ConditionVars?

(3) Overall, I’ve been using 3 task_chains for several months giving a decent performance ie getting in excess of 50fps (average) versus 5fps if merely using one task_chain. So this is a no brainer: for the kind of target app, threading is a must

The drawback is that there are some strange behaviour popping up (especially in Bullet) that I’d like to be able to understand and to clear up.

(1) Ah, OK. Most Panda objects should be automatically protected. Node positions, textures, sounds, yes, all perfectly safe for multithreaded access. There are a few more exotic objects within the Panda codebase–especially those added by students or other developers less directly connected with the core Panda team–that might not be adequately protected against mutual access. For instance, I think the particle engine might be unsafe (but generated particles are safe). Sorry, I don’t have a complete list of what is and isn’t safe. I don’t know about the Bullet objects.

(1’) Running collisions or physics on a separate thread is always a bit problematic, because usually you want the main thread to respond to physics events synchronously. You don’t, for instance, want to allow an object to penetrate into a wall for a frame before you get a chance to discover and react to the collision.

(2) You can use AsyncTaskChain::set_frame_sync(true) to make the chain synchronize with the main thread chain at a frame level. With this flag set true, the (threaded) task chain will run in parallel with the main task chain, but the frame will not advance until the main task chain also is ready to advance the frame.

However, this won’t help you if you have a specific task on the main task chain (like igloop, or any task that responds to the results of your physics) that requires the physics to be completed first. For this, you will need to use your own synchronization primitives, such as ConditionVar.

Condition variables are a standard synchronization primitive; Panda’s ConditionVar model follows the standard convention. Try Googling for “condition variable”.

David

1 Like

Thanks for your guidance.

However related to point (2)

This is the behaviour I expected by setting (as described above):

   // set up a new task chain to be used in thread mode for audio & video processing:
   task_chain_2 = taskMgr->make_task_chain("task_chain_2");
    task_chain_2->set_num_threads(1);
   task_chain_2->set_frame_sync(true);
... same for task_chain_3...

But in this case the delta timing for both task_1 task_2 task_3 should be aligned ie should be syn’c with the slower task_chain thread?! Which is not what seems to happen (see above).:cry:

Ah Ah, I see, in order to get fully sync. the slower task should be assigned to task_chain_1 which appears to be the “master task_chain”…
This was not obvious…

Hmm, I don’t think this is intended. Are you saying that if the main task chain is not the slowest one, the frame sync fails, as if the main task chain isn’t waiting for the sub-thread task chains?

Hmm, on reflection, that might indeed be the way it’s built. But I don’t think that’s desirable behavior. Let me think on that some more.

David

Well you see, in my example all task_chains were set up with set_frame_sync(true), but apparently the “master task_chain” was not waiting for completion of the slow guy (task_chain_3)… Would this have been the case, the delta time must have been roughly equal I suppose…

## TASK 1 re-exerciced after 0.0194617 seconds
## TASK 2 re-exerciced after 0.0118713 seconds
## TASK 1 re-exerciced after 0.00945009 seconds
## TASK 2 re-exerciced after 0.00939941 seconds
## TASK 3 re-exerciced after 0.239088 seconds
## TASK 1 re-exerciced after 0.0108739 seconds
## TASK 2 re-exerciced after 0.0109558 seconds 

jfyi.
I’ve just tried to rebalance the task_chain load so that the slowest one is task_chain_1.
Here are the results

## TASK 1 re-exerciced after 0.0794035 seconds
## TASK 3 re-exerciced after 0.020194 seconds
## TASK 4 re-exerciced after 0.0202004 seconds
## TASK 2 re-exerciced after 0.0201275 seconds
## TASK 4 re-exerciced after 0.0191932 seconds
## TASK 3 re-exerciced after 0.0191997 seconds
## TASK 2 re-exerciced after 0.0191966 seconds
## TASK 3 re-exerciced after 0.0226837 seconds
## TASK 4 re-exerciced after 0.0226882 seconds
## TASK 2 re-exerciced after 0.022687 seconds
## TASK 4 re-exerciced after 0.0195121 seconds
## TASK 3 re-exerciced after 0.0195166 seconds
## TASK 2 re-exerciced after 0.0195143 seconds
## TASK 1 re-exerciced after 0.0855763 seconds
## TASK 4 re-exerciced after 0.0227592 seconds
## TASK 3 re-exerciced after 0.0227775 seconds
## TASK 2 re-exerciced after 0.0227659 seconds
## TASK 3 re-exerciced after 0.0202487 seconds
## TASK 4 re-exerciced after 0.0202973 seconds
## TASK 2 re-exerciced after 0.020325 seconds
## TASK 3 re-exerciced after 0.0192568 seconds
## TASK 4 re-exerciced after 0.0192333 seconds
## TASK 2 re-exerciced after 0.0191966 seconds
## TASK 3 re-exerciced after 0.0184327 seconds
## TASK 4 re-exerciced after 0.0184335 seconds
## TASK 2 re-exerciced after 0.0184457 seconds
## TASK 1 re-exerciced after 0.0802387 seconds

giving something around 42fps…with spy print enabled, and roughly 60fps if no spy printing.

I have one (more!) additional question related to task_chain threading.

Bear with me, consider the following sketch code:

// globals
CollisionTraverser *C_trav        = NULL;		
PT(CollisionHandlerPusher) Pusher = NULL;

// call back for pusher events
void bump(const Event *pTheEvent, void *pData) {
	... whatever actions
}

AsyncTask::DoneStatus Collisions(GenericAsyncTask* task, void* data) {
	...
 	C_trav->traverse(render);
       ...
	return AsyncTask::DS_cont;
}	

...
C_trav = new CollisionTraverser();		
Pusher = new CollisionHandlerPusher();
Pusher->add_in_pattern("BumpIntoStuff");
EventHandler *app_eventHandler    = EventHandler::get_global_event_handler();
app_eventHandler->add_hook("BumpIntoStuff", &bump, "n"); // bump new


T_task2 = new GenericAsyncTask("Task2", &Collisions, NULL);
T_task2->set_task_chain("task_chain_2");
taskMgr->add(T_task2);
...

Question: which thread is executing the pusher call back event? is it the one running task_chain_2?
And is this call back supposed completed after the instruction C_trav->traverse(render)?

Additional question:
I suppose task->set_sort(n) is prioritizing tasks belonging to the same task_chain, correct?

Actually, the EventHandler by default will queue up events and deliver them to the main thread (the queue is emptied by the task named “event” which PandaFramework adds to the main task chain). So, that being the case, the bump() function will be called in the main thread, even though you are running the collisions in a secondary thread. And the function won’t necessarily have been called yet by the time you return from traverse().

Correct.

David

Thanks I’m getting better visibility for Task_chain ordering then. It’s getting tricky indeed!

Sibling question, consider:

framework.get_event_handler().add_hook("button_down", event_button, NULL); 

static void event_button (const Event* evenmt, void* data) {
	std::string button_code = evenmt->get_parameter(0).get_string_value();
	if (button_code	     == "f8")	  { actionF8();	return; }
	else if (button_code == "escape") { actionESC();return; }
	...
}

When is the button event supposed to be treated? I suppose by the main task chain?

Is the task named “event” performed at the very beginning of the main task? ie at the start of a new frame?

Right, by the main task chain.

You can see this task being created if you look in pandaFramework.cxx. It doesn’t specify any particular task sort, so it gets added with a sort value of zero. So that means there’s no particular guarantee that it will be the first task executed within a frame. You can, of course, set its task sort to the negative value of your choosing if you wish to guarantee this.

David

Thanks a lot David, this is food for thought.

I’ll go over all this and try to figure out the logical ordering and dependency of events and treatment performed in my code dispatched among task_chains.

These few days have been very instructive on the subject!!

Included :

// insure event task is done at the very beginning of each frame
AsyncTask *event_task = taskMgr->find_task("event");
event_task->set_sort(-1000);

This gives indeed better predictability!

BTW. about the AsyncTaskChain::set_frame_sync(true)

Did you add a chance to have a look on this one? Thks!

Additional detailed question (sorry for the rate of the questions…)

If I remember correctly when the pusher event callback procedure is called, the pusher has already moved back the colliding node (or its parent) (ie changed its coordinates).
The question then is where and when this coordinate change occurs: in which task_chain? The one C_trav->traverse(render) is running or the main task_chain?

SUGGESTION. I tend to think that in any case it would be preferable for an action inited in a task_chain to get(if possible) its ‘consequences’ taking place in the same task_chain. If not this is really difficult to figure out how to synchronize events properly. Maybe this has to be considered within Panda core design?

Still trying to cope with the smart use of Panda threads, here is what I tried in order to insure the pusher operates right after the collider traverser and in order too to balance load on each thread.

// reassign "event" to appropriate task_chain
AsyncTask *event_task = taskMgr->find_task("event");
event_task->set_task_chain("task_chain_2");
event_task->set_sort(+1000);

std::cout << "all tasks : " << taskMgr->get_tasks() << "\n";

for (int tchain_id=0; tchain_id<taskMgr->get_num_task_chains(); tchain_id++) {
	AsyncTaskChain *task_chain = taskMgr->get_task_chain(tchain_id);
	std::cout << "task_chain " << tchain_id 
		<< " name: " << task_chain->get_name()
		<< " : " << task_chain->get_tasks() << "\n";
	AsyncTaskCollection task_collection = task_chain->get_tasks();
	for (int ntask=0; ntask < task_chain->get_num_tasks(); ntask++) {
		AsyncTask *curr_task = task_collection.get_task(ntask);
		std::cout << "\ttask " << curr_task->get_name() << "\t sort: " << curr_task->get_sort() << "\n"; 
	}
}
taskMgr->start_threads();

This nicely gives:

all tasks : 11 AsyncTasks
task_chain 0 name: default : 4 AsyncTasks
        task data_loop                   sort: -50
        task update_joints_from_tracker  sort: 1
        task igloop                      sort: 50
        task garbageCollectStates        sort: 46
task_chain 1 name: loader : 0 AsyncTasks
task_chain 2 name: task_chain_2 : 2 AsyncTasks
        task Collisions_and_Gravity_update    sort: 5
        task event                            sort: 1000
task_chain 3 name: task_chain_3 : 4 AsyncTasks
        task update_visitors       sort: 3
        task AudioUpdate           sort: 14
        task updateOpenCVVideo     sort: 15
        task update_sky            sort: 17
task_chain 4 name: task_chain_4 : 1 AsyncTask
        task simulationTaskBullet  sort: 7

ooooh, just giving an healthy 85 fps and no crash/freezes (yet??)…

I hope not to get hung on Panda’s pirats pole for such as hack activity!!!

All this rises some questions:

(1) in what remains in task_chain 0:
what’s the role of data_loop, garbageCollectStates.
where should be their optimal sort versus the other tasks

(2) where are the animations treated?

(3) I suppose in the following other_code() is performed on task_chain 0 (default)…

while ( (framework.do_frame(current_thread)) && (!exit_application) ) {
	// Step the interval manager (advance animations)
	CIntervalManager::get_global_ptr()->step();
	// other code
	other_code();
}

(4) what about if the number of task_chains exceeds the number of processors?
Are related execution priorities based on sort numbers?

(5) what about the reported FPS? and reference clock (wall clock?)
if I insert a spy task to each task_chain (ie a task that simply spy the recurring
activity of the related task_chain), this is what I get:

with a reported FPS having dropped to roughly 40 fps due to the spy print activity.

 ## TASK 0 re-exerciced after 0.100136 seconds
## TASK 2 re-exerciced after 0.0246729 seconds
## TASK 4 re-exerciced after 0.0253347 seconds
## TASK 2 re-exerciced after 0.0253542 seconds
## TASK 4 re-exerciced after 0.0249861 seconds
## TASK 2 re-exerciced after 0.0249715 seconds
## TASK 3 re-exerciced after 0.225628 seconds
## TASK 4 re-exerciced after 0.0250216 seconds
## TASK 2 re-exerciced after 0.025024 seconds
## TASK 0 re-exerciced after 0.100156 seconds
## TASK 4 re-exerciced after 0.0248161 seconds
## TASK 2 re-exerciced after 0.0247877 seconds
## TASK 4 re-exerciced after 0.0253014 seconds
## TASK 2 re-exerciced after 0.0253261 seconds
## TASK 4 re-exerciced after 0.025075 seconds
## TASK 2 re-exerciced after 0.0250721 seconds
## TASK 4 re-exerciced after 0.0250834 seconds
## TASK 2 re-exerciced after 0.0250709 seconds
## TASK 4 re-exerciced after 0.0250344 seconds
## TASK 0 re-exerciced after 0.100516 seconds 

So:
a- how can a FPS rate be reported at 40fps when task0 is recurring at 0.1s intervals??
b- why are the other tasks recurring several times (some at 0.02s) ? (even if declared sync’d with main task)

Things are getting clearer, but … still scratching my head!!

The pusher really does all of its work in the collision task chain. By the time your traverse() call is finished, all of the collisions have been detected, the pusher has been activated, and has pushed all objects out of their walls. The only thing that is deferred is the event system, which is used to tell the application code that a collision has occurred.

If you require an immediate notification within the same thread, you will have to subclass CollisionHandlerPusher and override handle_entries(), which is called when the pusher is activated (within the thread).

To answer your remaining questions:

(1) data_loop reads the keyboard and mouse inputs. It should probably be run early in the frame, before you query mouse position or anything. garbageCollectStates is a cleanup function that ensures that unused RenderState and TransformState objects are appropriately deleted. It doesn’t really matter much when in the frame it executes, but it should execute at least once per frame.

(2) Animations are generally handled by the cull traversal, unless they are explicitly invoked by the application code. When handled by the cull traversal, they will be taken care of by the igloop task, which should always be on the main thread. If you are running with graphics-threading-model set, of course the cull traversal will be handled by whatever custom thread you created for cull.

(3) Correct. This is main thread stuff.

(4) Of course there is no requirement that you have fewer task chains than processors. We rely on the operating system to balance threads appropriately on the number of processors that are available. We don’t have a lot of contorl about relative execution priorities, but if you wish to control this, you may specify a suitable priority per task chain with AsyncTaskChain::set_thread_priority(). Your choices are TP_low, TP_normal, TP_high, and TP_urgent. I don’t recommend using TP_urgent unless you know what you’re doing.

(5) The reported FPS is based on the cycle time of the igloop task, which should be in the main thread. (On some platforms you may be able to move this task to another task chain, but this a bad idea because it isn’t portable, and could cause parts of the system to break in subtle ways. If you wish to take the igloop processing off of the main thread, it’s better to use graphics-threading-model to achieve this instead.)

David

1 Like

Thank you so much David for such detailed and helpful infos,

I still have a difficulty trying to understand the reported FPS and cycle time of each task_chain whenever I had a SPY in them. ie The delta time figures I reported above in my previous post simply make no sense.

Spy task included for each task_chain n

AsyncTask::DoneStatus SPY_n(GenericAsyncTask* task, void* data){
	static double LastTimeExerciced = 0.0;
	double CurrTime = globalClock->get_real_time(); // *** I suppose wall clock
	double dt = CurrTime - LastTimeExerciced;
	LastTimeExerciced = CurrTime;
	std::cout << "## TASK n re-exerciced after " << dt << " seconds\n";
	return AsyncTask::DS_cont;
}

So I’m still puzzled with :

I don’t know. You haven’t set tick_clock on any of your other task chains, have you?

Strictly speaking, the FPS (and other things) is based on the rate at which global_clock->tick() is called. This function is normally called once per frame by igloop, but it can also be called by any task chain with tick_clock set on it. If you accidentally call it twice per frame for some reason (by having igloop and some other task chain call it, for instance), then FPS calculations and all sorts of other low-level things will start to go screwy.

How does your frame reporting look in PStats? That also depends on tick() being called once per frame, but looking at its graph may help clarify what’s going on. Note that PStats reports a different graph for each thread.

David

1 Like