cull problems in cvs trunk

treeform · December 24, 2008, 3:24am

I have moved to developing my game on cvs head to get the new features that are rolling out.

I have run into problems the Cull process, it appears to be taking very long time now almost as much as draw and python tasks. This was not a problem before. So i think some sort of bug has crept into the cull process. Also i would be interested in controlling the cull process a little bit more.

bugs.launchpad.net/panda3d/+bug/311028

drwr · December 24, 2008, 6:32am

It is not universally true that cull is slower now than it used to be. In many cases (in particular, the cases I have been working with), it is faster than it used to be.

So, I need more information to track this down. Is it possible to put together a small sample application that performs better in 1.5.x than on the cvs head?

David

treeform · December 24, 2008, 6:51am

I am still investigating why its slow. It would help a lot if we knew a more about how cull works. I cant find any docs on explaining what it does and what algorithms it uses and what effects it.

here is the render.ls() dump
dpaste.com/102100/

As you see i use hidden and LOD nodes alot.
basic outline
root

galaxy
startsystem

galaxy is hidden when star system is on and reverse is true.

star system contains importantly ships

which are basically
ship with LOD node

hull
shilds (hidden)
enge points (hidden)
weapon system LOD node +RigidBody
– weapon
— rotation
— gun itself

I also have the far system display region and ui display region.

drwr · December 24, 2008, 7:06am

Cull is the entire process of preparing the scene for drawing. It includes the traversal of the scene graph from the root, the accumulation of state and transforms, the computation of animation, the sorting of nodes into bins, and the assembling of a list of drawable objects.

The input to cull is a scene graph. The output is a list of renderable objects and their associated state, in renderable order.

Since “Draw” is the process of issuing graphics commands as rapidly as possible with as little CPU required as possible, “Cull” is the process of doing all the CPU work on the scene in preparation for drawing.

In many other graphics engines, “Cull” and “Draw” are not separate tasks, but are done more or less simultaneously: the engine will traverse the scene graph and issue graphics commands as it goes. Panda’s design is intended to pave the way for pipelining of the tasks, so that Cull and Draw may theoretically be performed in parallel–we Cull frame n while simultaneously Drawing frame (n - 1). While not yet fully implemented, when it is ready, this will provide the maximum theoretical throughput for Panda’s rendering.

But the short answer to your question is, Cull performs a lot of things. It is not simply view-frustum culling (although that is part of it). It’s impossible to speculate on what precisely may be slower now than it used to be. I’d have to have some actual code to run and profile A/B.

David

treeform · December 24, 2008, 7:48am

Here is basic layout of my scene approximated with boxes.

This program runs very fast cull. I guess triangle count has effect on cull step. Ill try to get more appropriate models next.

from pandac.PandaModules import *
loadPrcFileData("", """
want-pstats 1
#task-timer-verbose 1
#pstats-tasks 1
show-frame-rate-meter 1
""") 
from random import random
import direct.directbase.DirectStart
a = 10
for x in range(-a,a):
    for y in range(-a,a):
        mainBoxLod = NodePath(LODNode("lod"))
        mainBoxLod.node().addSwitch(100,0)
        mainBoxLod.reparentTo(render)
        mainBoxLod.setPos(x*5,y*5,0)
        
        p = loader.loadModel("box")
        p.setColor(1,0,0)
        p.reparentTo(mainBoxLod)
        
        w = NodePath(LODNode("lod"))
        w.reparentTo(p)
        w.node().addSwitch(30,0)
        wrbc = NodePath(RigidBodyCombiner("rbc"))
        wrbc.reparentTo(w)
        for n in range(10):
            t = loader.loadModel("box")
            t.setScale(.3)
            t.setPos(0,0,n*.5+2)
            t.setColor(.5,.5,.5)
            t.reparentTo(wrbc)
            
            t2 = loader.loadModel("box")
            t2.setScale(.3)
            t2.setColor(1,.5,.5)
            t2.setPos(0,2,0)
            t2.reparentTo(t)
        wrbc.node().collect()  
#render.ls()
print "done building!"  
run()

rdb · December 24, 2008, 8:20am

Actually, with that sample code, I get 460 fps on 1.5.3, and 500 fps on 1.6.0.

treeform · December 24, 2008, 8:22am

Yeah it it has similar node arrangement my game while my game runs super slow!

The program is a counter example to my problems.

I am scratching my head still.

… I am wondering if the Flip phase of the graphics card some how shifted into cull.

treeform · December 24, 2008, 8:25am

from pandac.PandaModules import *
loadPrcFileData("", """
want-pstats 1
#task-timer-verbose 1
#pstats-tasks 1
show-frame-rate-meter 1
""") 
from random import random
import direct.directbase.DirectStart


def box(trycount):
    n = NodePath("lets kook a box")
    p = loader.loadModel("box")
    for i in range(trycount/12):
        for geom in p.findAllMatches('**/+GeomNode'):
            geom.setPos(random(),0,0)
            geom.copyTo(n)
    n.flattenStrong()
    return n
a = 10
for x in range(-a,a):
    for y in range(-a,a):
        mainBoxLod = NodePath(LODNode("lod"))
        mainBoxLod.node().addSwitch(100,0)
        mainBoxLod.reparentTo(render)
        mainBoxLod.setPos(x*5,y*5,0)
        
        p = box(12000)
        p.setColor(1,0,0)
        p.reparentTo(mainBoxLod)
        
        w = NodePath(LODNode("lod"))
        w.reparentTo(p)
        w.node().addSwitch(30,0)
        wrbc = NodePath(RigidBodyCombiner("rbc"))
        wrbc.reparentTo(w)
        for n in range(10):
            t = box(100)
            t.setScale(.3)
            t.setPos(0,0,n*.5+2)
            t.setColor(.5,.5,.5)
            t.reparentTo(wrbc)
            
            t2 = box(100)
            t2.setScale(.3)
            t2.setColor(1,.5,.5)
            t2.setPos(0,2,0)
            t2.reparentTo(t)
        wrbc.node().collect()  
#render.ls()
print "done building!"  
run()

this program has a similar polycount to my game. Loads very slow though because it needs to flatten so much.

But it also runs fast when it finishes loading!

drwr · December 24, 2008, 3:14pm

Cull has nothing to do with polycount. The things that make cull expensive in general are complicated node structures, and/or too many nodes, state changes, or geoms.

“Flip” is the time spent waiting for the graphics card to finish drawing the previous frame, so that time doesn’t directly migrate into Cull, which is entirely CPU-based. (However, if Cull becomes more expensive for other reasons, you might see the Cull bar increase in PStats while you see the Flip bar decrease by the same amount, because the CPU takes longer to start waiting for the graphics card, and therefore has less time to wait. But if you see this, it is a side-effect, not a cause.)

David

treeform · December 24, 2008, 5:45pm

Would node motion effect cull? Do nodes not in render effect cull? What are some node configurations one should no do?

drwr · December 24, 2008, 6:01pm

It’s difficult to generalize. Node motion doesn’t affect cull, directly, but repeatedly moving lots of nodes can bloat your TransformState cache, which can in turn impact cull performance. (You can determine whether this is a problem by disabling the cache with transform-cache 0, to see if that makes a difference.)

Nodes not in render don’t affect cull. Stashed nodes don’t affect cull. Hidden nodes generally don’t affect cull, but that’s more complicated because we have the showThrough() method which can undo the effects of hide(), and there’s a system of bitmasks that implements this complexity.

It’s hard to recommend specific actions not to do. There are lots of silly things you might do, of course, like having an unnecessarily deep hierarchy with long chains of one-child nodes, or conversely having an unnecessarily wide hierarchy with many hundreds of children to one parent. You could also overuse transforms with a transform or state change on every node, or fill the scene graph with billboards or compass effects, or abuse LOD’s to the point of making every leaf on a tree its own separate LOD. All of these things increase the work that must be done in the cull task. But that’s not to say that you shouldn’t use LOD’s or billboards, because used wisely, they usually decrease the work that must be done in the cull task. The general idea is just to minimize the number of nodes that should be visited during the traversal, and the amount of work that must be done for each node.

But I want to get back to your earlier thesis: you said that cull was more expensive on the cvs head than it was on the 1.5.x branch. Have you found definitive evidence of that? I’d love to be able to say that Panda gets faster with every release, and in general I do think it’s true; but in order to justify this I need to be able to analyze aberrant cases like you’re reporting. Is it only in your full application that you observe a performance difference between 1.5.x and the cvs head?

David

treeform · December 24, 2008, 6:10pm

It looks like a significant part is taken up by display region 2 ~ the ui

ui off
affuniverse.com/artdepot/pub … pix/38.jpg
ui on
affuniverse.com/artdepot/pub … pix/39.jpg

I pretty sure dr_4 is the background display (background stars and planets + depth clear) region but what is dr_2?
affuniverse.com/artdepot/pub … pix/43.jpg

Next my glow post process is taking up almost as much time as the real scene!
affuniverse.com/artdepot/pub … pix/40.jpg
It has only one display region dr_0
affuniverse.com/artdepot/pub … pix/41.jpg

all in perspective:
affuniverse.com/artdepot/pub … pix/42.jpg
without many of the python tasks:
affuniverse.com/artdepot/pub … pix/42.jpg

drwr · December 24, 2008, 7:01pm

Yeah, that’s huge. What’s going on in your render2d scene graph?

Well, dr_2 is base.win.getActiveDisplayRegion(2), whichever one that is. It doesn’t appear to be significant in your graphs; it might be the fps meter.

Does your glow post process require traversing the same scene graph again? It certainly looks like it does.

David

treeform · December 24, 2008, 7:08pm

The same UI problem i was having before. I think ill try a different approach with trying to draw all windows in and icons in one geom object.
— edit —

i have figured out what was really wrong here. I have used lots of icons and icons inherited from buttons which had a label. So icons ended up having labels and not a single icon used them so there was a bunch of blank TextNodes

Second i use clip planes to contain elements in side of windows. It looks like that clip plains spread in similar ways as labels to higher level other components.

GUI is faster now but i still have the number of primitive batches problem.

— edit —

Yes it does. Is there a way to cache the previous cull process and just issue all the vertex draw commands again with a different shader?

drwr · December 24, 2008, 7:29pm

Not really, because some of the data that cull accumulates is the render state, which includes the shader.

We have allowed a special exception to share a single cull traversal between the left and right eye of a stereo pair, though, because in that case we know that the states are identical, and the transform is only slightly different; we might be able to make a similar special exception in the case of a post-process like this, where we know that the transform is identical and the state is only slightly different. It will require some new code, though, and you’ll have to know what you’re doing to enable it. I’ll look into this.

David

treeform · December 24, 2008, 7:37pm

I think the case where one shares the same cull process but different shader will come into play more and more as people will try to use shaders with post processing in their code like i have.

In the future i plan 2 more post processing fx for “disturbance” to simulate shockwaves and water/lens like properties. And a post process for fuzzy particles.

If no cull optimization is done i will end up with 4 cull process on the same nodes, which would probably not allow their real time application.