Example: Hardware and Geometry Shader Instancing

I am posting a sample shader that use both hardware and geometry instancing.

Basically this is what is happening:

  • Only one instance of the model is drawn by panda3d.
  • For each vertex the vertex shader is called n times as determined by the function call node.setInstancesCount(n), in this case n=2.
  • Each vertex is piped from the vertex shader into the geometry shader. The geometry shader for each triangle generates four new shifted triangles.
  • The fragment shader simply calculates the light and outputs the final, color.
//Cg profile gp4vp gp4fp

struct VertexDataOUT{
  float4 o_position  :POSITION;
  float2 o_texcoord0 :TEXCOORD0;
  float3 o_normal    :TEXCOORD1;
  float3 o_objpos    :TEXCOORD2;

struct VertexDataIN{
  float4 vtx_position  :POSITION;
  float2 vtx_texcoord0 :TEXCOORD0;
  float3 vtx_normal    :NORMAL0;
  int l_id             :INSTANCEID;
//vertex shader
void vshader(VertexDataIN IN,
             out VertexDataOUT OUT,
             uniform float4 instances_position[2],
             uniform float4x4 mat_modelproj)
	float4 vpos = IN.vtx_position + instances_position[IN.l_id]; 
  OUT.o_position  = mul(mat_modelproj,vpos);
  OUT.o_objpos    = IN.vtx_position;
  OUT.o_texcoord0 = IN.vtx_texcoord0;
  OUT.o_normal    = IN.vtx_normal;
//geometry shader
TRIANGLE void gshader(AttribArray<float4> position  : POSITION, 
                      AttribArray<float4> texcoords : TEXCOORD0,
                      AttribArray<float4> normals   : TEXCOORD1,
                      AttribArray<float4> objpos    : TEXCOORD2,
                      uniform float4x4 mat_modelproj)
  int n = 4;
  for(int r=-2; r<2;r+=1) {
    for(int i=0; i<position.length; i++) {
      float4 offset = mul(mat_modelproj,float4(r*2,0,0,1));
      float4 npos = position[i] + offset; 
      emitVertex(npos         :POSITION,
                 normals[i]   :TEXCOORD1,
                 objpos[i]    :TEXCOORD2,
                 texcoords[i] :TEXCOORD0);

//fragment shader
struct FragmentDataIN{
  uniform float4 mspos_light;
	uniform float Ka;
	uniform float Kd;
	uniform sampler2D tex_0;

void fshader(VertexDataOUT  vIN,
             FragmentDataIN fIN,
             out float4 o_color :COLOR)
  float3 P = vIN.o_objpos.xyz;
  float3 N = normalize(vIN.o_normal);

  // compute the ambient term
  float ambient  = fIN.Ka;
  // compute the diffuse term
  float3 L = normalize(fIN.mspos_light - P);
  float diffuseLight = max(dot(N,L),0);
  float3 diffuse = fIN.Kd *  diffuseLight;

	o_color = tex2D(fIN.tex_0, vIN.o_texcoord0);
	o_color.xyz = o_color.xyz * (ambient + diffuse);

Things to note are:

  • The number of instances is determined by calling node.setInstanceCount(n)
  • Objects instancing should be done by hardware, I am using the geometry shader only to show its potential.
  • As I am currently testing some updates not yet available, in the shader system, the code is not going to work out of the box. Especially since I am passing an array of float4, and I got rid of the k_ prefix for the uniform parameters. However we hope to have this updates committed in the source soon.

Hope it helps.


Thanks for sharing this!

I’ve moved the thread into the Code Snipplets forum, as it didn’t appear to me that there were particular scripting issues with this code, correct me if I’m wrong.

No problem with the script. Thanks for moving it.

i’m still a noob in Cg -and therefor have little use for the code- , but things like this are pushing forward the whole project, which is very motivating.

yeah very cool, i actually saw your posting. so far i was reading your shading, your geometry shader is exactly the thing i need.

I am glad it’s helpful. Remember that for instancing you should consider to use setInstanceCount and :INSTANCEID instead of the geometry shader. We kind of talked about it in this post:

discourse.panda3d.org/viewtopic … highlight=

- As I am currently testing some updates not yet available, in the shader system, the code is not going to work out of the box. Especially since I am passing an array of float4, and I got rid of the k_ prefix for the uniform parameters. However we hope to have this updates committed in the source soon. 

We just committed these new features in the CVS, with them this shader will work as it is.

Please refer to this link for more informations:


Can you post the accompanying python code that runs this shader? I’m confused on how to set the
“instances_position”, in the shader.



I am having trouble getting this to work, could someone make full example ( python code + shader code ).

Note that the hardware instancing only runs on NVIDIA 8-series cards or newer. It will not work on ATI since the drivers do not support the necessary Cg profile.
It does work on ATI cards if you use a GLSL shader and apply this patch: [url]gp4vp shader profile not available - Win7, Radeon HD4890].
Maybe we could get that included in the official build?


from pandac.PandaModules import loadPrcFileData
loadPrcFileData('', 'basic-shaders-only 0')
loadPrcFileData('', 'sync-video 0')
loadPrcFileData('', 'show-frame-rate-meter 1')
from direct.actor.Actor import Actor
from direct.showbase.DirectObject import DirectObject
from direct.interval.IntervalGlobal import Sequence
import direct.directbase.DirectStart
from pandac.PandaModules import Point3, Vec4, PTAVecBase4, Shader

class World(DirectObject):
    def __init__(self):
        self.accept("escape", __import__("sys").exit, [0])

        self.model = Actor('panda-model', {'walk': 'panda-walk4'})
        interval = self.model.posInterval(20, Point3(-2.7, 200, -5), startPos=Point3(-2.7, 300, -5))
        sequence = Sequence(interval)

        k = 256
        offsets = PTAVecBase4.emptyArray(k);
        count = 0
        for i in range(10):
            for j in range(k/10):
                offsets[count] = Vec4(i * 3, j * -8, 0, 0)
                count += 1
        self.model.setShaderInput('offsets', offsets)

w = World()


//Cg profile gp4vp gp4fp

void vshader(float4 vtx_position: POSITION,
             float2 vtx_texcoord0: TEXCOORD0,
             uniform float4x4 mat_modelproj,
             int l_id: INSTANCEID,
             uniform float4 offsets[256],
             out float4 l_position : POSITION,
             out float2 l_texcoord0 : TEXCOORD0)
  l_position = mul(mat_modelproj, vtx_position + offsets[l_id]);
  l_texcoord0 = vtx_texcoord0;

void fshader(float2 l_texcoord0: TEXCOORD0,
             uniform sampler2D tex_0: TEXUNIT0,
             out float4 o_color: COLOR)
  o_color = tex2D(tex_0, l_texcoord0);

Heh, thanks for help, but i still get error, looks like shader wont accept array.

 self.model.setShaderInput('offsets', offsets)
TypeError: Arguments must match one of:
setShaderInput(non-const NodePath this, const ShaderInput inp)
setShaderInput(non-const NodePath this, non-const InternalName id)

You might need to get one of the snapshot builds, I don’t know if the arrays as shader inputs is in the official release yet.

Ok, with newest buildbot release it works, thank you very much for writing example!

Is hardware instancing considered advanced shader in Panda3d terms? (so will it work only on newer hardware?)
I’m going with geometry batching for my forest, but having 1 geom is always better.

BTW, Ive heard you have to implement your own culling. How can you do that? Should you just attach a bounding box to each instance somehow?

I’d consider it an advanced shader, but there are still many GPU generations support it.

The method you suggested won’t work as there will only be one node to attach a bounding volume to.
You could write a geometry shader that does frustum intersection for every triangle and discards a triangle if it’s out of view. If you can’t afford a geometry shader, you could have a position assigned to every instance and a global radius for the object, and do an intersection test between the frustum and the sphere with that instance’s position and the global radius.

You’ll need to write that code yourself, in the shader.

I cant do that.
Without it, this shader is pretty useless, at least in my case

Well, probably you would still get some performance improvement from using instancing and doing no culling at all (apply an OmniBoundingVolume to the model). It is still reducing the number of geoms which is usually the performance bottleneck, and trading for more triangles drawn which is usually OK.
I use it for moving objects which I cannot otherwise combine.

Well, I was thinking of using this for my forest. But this means it will decrease the geom count to few, but will have the opposite effect on the vertex count, which can go over million like this.

And I think if I had the knowledge to write a shader to handle the culling, I could write an instancing shader myself.

BTW, for moving objects have you tried rigidbodycombiner? It basically makes a single geom and attaches a joint to each object and moves them instead.

Many modern graphics cards can handle millions of vertices without sweating. You should try it and see what kind of performance you get.

The RigidBodyCombiner helps for small-to-moderate scenes, but because it’s entirely implemented on the CPU, it doesn’t do a good job with lots of vertices.


You could divide up your forest into groups of trees, each of which is flattened or uses hw instancing. If you use hw instancing, then create a bounding sphere or box enclosing this group of trees. This way you can get some rough culling while still limiting the number of geoms.