Question concerning culling and geometry instancing.

Hi everyone, I’ve been considering switching from Torque 3D to Panda 3D for a while now, and I’m leaning in that direction as Panda seems much more flexible, but I have some questions about the rendering pipeline.

Right now I’m envisioning a use case that looks something like this:

  • I have many objects in my world with shared geometry and textures, but different colors blended over (presumably stored as a per-node shader constant?)
  • After culling, I would like to batch these objects into as few draw calls as possible, preferably using hardware mesh instancing (as described here: panda3d.org/blog/?p=44)

Unfortunately, I don’t quite understand the rendering pipeline model used by Panda3D, and the documentation I’ve read through hasn’t been terribly enlightening. So my question is this: is the use case I’m describing a viable one, and if so, where should I start looking if I wanted to put it together? Can someone name-drop the relevant Panda-jargon or provide some high-level psuedocode to put me on the right track? Is there any information on minimum hardware or software requirements for certain types of features (for example, hardware mesh instancing)?

Related, but less pressing, questions:

  • Has anyone looked at implementing CHC++ in Panda’s culling system? Is the current system flexible enough to try a drop-in replacement? Is there any really good documentation of the pipeline on the current system?
  • Are there any decent references on networked multiplayer in Panda? Most of the demos seem to be concerned with single-player or same-machine multiplayer.

Thanks!

p.s. Sorry if this is the wrong subforum, but I think it fits.

You could use hardware instancing for what you have in mind. I don’t know which video cards do and don’t support it, but we use the extension GL_ARB_draw_instanced/GL_EXT_draw_instanced for this.

To use instancing, you need to load the model and instance it into the scene as if you just wanted one instance. Then, set a custom bounding volume on your model (since Panda will otherwise cull the whole thing away when your primary model is no longer visible) or an OmniBoundingVolume to disable culling. Then, you use setInstanceCount(n) and apply a shader that uses the INSTANCEID input integer to select a proper position (eg from an array of positions that you pass to the shader).

Thanks for the info, that is definitely helpful. One question I have is whether or not it’s possible to hijack the rendering queue AFTER culling is performed and perform the setInstanceCount then (in the way that is done by other renderers which perform batching)? If I have a very large number of instances and only a small fraction might be on screen at any one time, it seems like it would be of benefit to not try and draw them all at once.

You can simply use custom bounding volumes for this, or is there a custom culling algorithm that you want to implement? In that case, you could create a CallbackNode to create a custom cull callback for your model.

I think that either I’m misunderstanding how the custom bounding volumes work or I’ve communicated poorly what I have in mind. Let my try and articulate this a little bit better.

Let’s say I have scene that looks something like this (where the blue represents my camera’s field of view, and green is a bounding volume encompassing all of my objects with the same mesh.

As far as I understand what you’re saying, I would include only one of those in my scenegraph, but setInstanceId(30) and a set the bounding volume to approximate the green box. However, as I understand the rendering process, this would lead to wasted effort executing the shader, as only 8/30 of the instances actually appear within the visible cone.

What seems to me to be the better option, and is similar to implementations of software-batched rendering I’ve seen in other engines is to wait until AFTER the culling has been performed and use that information to decide to setInstanceId(8) (instead of 30) on one, delete the others from the queue of objects being rendered, and set up the shader with the transforms of only those 8 instances which weren’t culled.

Hope this clears up any confusion (or that it will allow you to clear up any on my part).

This sounds like it might be what I want, unless the newly clarified first part of my post led you to a different conclusion/suggestion.

Hmm, you do raise a good point. I suppose the tricky part is that you need to know which nodes are in view so that you can pass the correct matrices to the shader - a properly integrated solution for hardware instancing would be nice for a future version of Panda, eg. a built-in shader variable that contains a list of model projection matrices for all visible instances.

However, you can work around this without using any kind of cull callback. For these objects, you can make your own mini-cull traversal that runs every frame, which then sets the shader input to the appropriate array of matrices.

Something like this:

master = loader.loadModel(...)
master.setShader(...)
master.node().setBounds(OmniBoundingVolume())
master.node().setFinal(True)

# Specify the bounds and positions for the instances
instanceBounds = [BoundingSphere(0.5), BoundingSphere(0.5)]
instancePositions = [Point3(1, 2, 3), Point3(4, 5, 6)]

# Now in a task, iterate over the instances, and select the ones that are in view
visiblePositions = []
lensBounds = base.camLens.makeBounds()
lensBounds.xform(base.cam.getMat(master))
for bounds, pos in zip(instanceBounds, instancePositions):
    bounds = bounds.makeCopy()
    bounds.xform(Mat4.posMat(pos))
    if lensBounds.contains(bounds):
        visiblePositions.append(pos)

master.setInstanceCount(len(visiblePositions))
master.setShaderInput("positions", visiblePositions)

You could probably make a better approach that uses dummy NodePaths to position the instances, but I hope you get the point.

Okay, I’m glad we’re on the same page here now.

The example you provided seems like it would work fairly well in the simple case. If I wanted to integrate with the existing cull system, I think I could do something like this (extending the example of software instancing shown here):

master = loader.loadModel(...)
master.setShader(...)
master.node().setBounds(OmniBoundingVolume())
master.node().setFinal(True) 

transforms = []

def cull_cb(cbdata):
   oldInstanceCount = len(transforms)
   master.setInstanceCount(oldInstanceCount+1)
   positions.append(cb_data.getData().getNetTransform())
   
   # Move to the setup code if setShaderInput
   # is pass-by-reference
   master.setShaderInput("transforms",transforms)
   if not oldInstanceCount:
      cbdata.upcall() # Only do this once!

for i in range(50):
   cbnode = CallbackNode('instancing-placeholder')
   cbnode.setCullCallback(PythonCallbackObject(cull_cb))
   cbnp = render.attachNewNode(cbnode)
   cbnp.setPos(10*(i/10), 10*(i%10), 0)
   master.instanceTo(cbnp)

Does this look correct/reasonable? Do you happen to know if setShaderInput is pass-by-value or pass-by-reference (i.e., would I need to call it to update on every callback, or only during the setup)?

The documentation says that upcall is needed to “continue” normal behavior, but it isn’t immediately clear if the normal behavior would have been to include or to exclude the child-nodes. If it’s the former, I might have a bit more work on my hands.

I don’t know if setShaderInput for lists is by-value or by-reference. The person who wrote the array input support was a CMU ETC student who I believe is no longer around.

From a quick glance, it seems that your code might work, assuming that you clear the “transforms” list every frame as well. It seems a bit hacky, though, and you should keep in mind that using a cull callback can be slower because of the overhead associated with calling into Python code from C++. Doing so will block the cull thread until your Python callback has finished. In your case, this would happen up to fifty times per frame, which could impose a considerable performance overhead.

Long story short, I think you’re better off using a task.

I suspect you’re right about the overhead issue. I might also consider writing some of the code in C++ and doing it that way now that I have a feel for some of the relevant APIs (and investigate whether the setShaderInput is by value or by reference while I’m at it). Thanks for all your help!