9 lights

I like to render more than 8 lights in one scene. If I assign e.g. 8 lights to
one node and 8 lights to another node everything works more or less as
expected. If I like to dynamically assign lights to nodes, I need to calculate
the radius of a light and cull away nodes that are outside this radius. Problem:
The attenuation of a point light is:

a(x) -> 1 / (k0 + k1 * x + k2 * x^2)

therefore every light has an infinite radius. The only solution I can see so far
is to say that if the attenuation is < epsilon (e.g. 0.01) that light ca not be
seen anymore.

Another possbility may be to create a shader with a “better” attenuation
equation e.g.

a(x) -> (-x / r + 1)^n

where r is the light radius and n is < 1.0. Or maybe I am wrong at all and
anyone has a better idea :slight_smile:

you need to look into deffered shading pipeline then. Look at the fireflies sample it can draw 100’s of lights.

Sorry I forget to write that I already considered Deferred Shading, but that is perhaps a bit overkill (and I am not sure if I can easily integrate directional lights). AFAIK most games today do not use Deferred Shading therefore I thought I should stick to a “traditional” solution.

Your solution seems perfectly reasonable. There is a radius beyond which a PointLight is effectively invisible (and you don’t need to fiddle with the attenuation algorithm to change this).

So all you need to do is find that radius for each light, and cleverly assign the lights only to the geometry that they can usefully illuminate.

David

Thanks for your answer.

I tried to recreate the same scene with a shader (not because of the attenution equation, just for fun).

Python:

world.setShader(shader)
world.setShaderInput("light0", light)
world.setShaderInput("light1", light)
world.setShaderInput("light2", light)
world.setShaderInput("light3", light)
world.setShaderInput("light4", light)
world.setShaderInput("light5", light)
world.setShaderInput("light6", light)
world.setShaderInput("light7", light)

Shader:

void vshader(
    uniform float4 mspos_light0,
    uniform float4 mspos_light1,
    uniform float4 mspos_light2,
    uniform float4 mspos_light3,
    uniform float4 mspos_light4,
    uniform float4 mspos_light5,
    uniform float4 mspos_light6,
    uniform float4 mspos_light7,

I don’t do anything inside the vertex/fragment (tried both) shader with the mspos_*. When I compare the framerate I’m a bit surprised.

scene without lighting: 120 fps
scene with per vertex lighting without shader: 100 fps
scene without lighting and a constant color shader: 120 fps
scene with 8 shader inputs and a constant color shader: 60 fps

When I analyze it with pstats I can only see that the time spent in

Draw -> window1 -> dr0 -> opqaue

is increasing.

(the fps drops down to 30 fps if the shader calculates all 8 lights correctly, but maybe that’s my fault).

Hmm. First I thought the problem is that Panda has to recalculate the light position for every node in every frame, because the vertex coordinates in the vertex shader are in the models local space. To test this I replaced every mspos_* with a k_*, but it was as slow as before (of course without correct lighting).

Then I tried to test the scene with setAutoShader. In this case the frame rate drops down to ~30 fps (as with my own shader). One may think because we have perpixel lighting now, but that is not the truth (at least on my grahic card).

I switched back to my own shader but replaced it with dummy shader that does nothing. Framerate increased slightly, but not beyond 34 fps.

flattenStrong/Medium/Light doesn’t help at all because every node has another state (happens when there are more 8 lights distributed over the scene).

So far I think the bottleneck is that Panda needs to set the shader inputs for every node, doesn’t matter if this are constant values.

What I don’t understand why the application roughly is 4 times faster with vertex lighting compared to scene with lots of shader input but without lighting at all. IMO I should be able to do vertex lighting with a shader as fast as with “old” TnL.

Any hints and or ideas?

It certainly does sound suspicious. How many nodes are in your scene?

world.analyze()

David

650 total nodes (including 0 instances); 0 LODNodes.
324 transforms; 0% of nodes have some render attribute.
324 Geoms, with 1 GeomVertexDatas and 1 GeomVertexFormats, appear on 324 GeomNodes.
28 vertices, 28 normals, 0 colors, 0 texture coordinates.
GeomVertexData arrays occupy 1K memory.
GeomPrimitive arrays occupy 1K memory.
5832 triangles:
  5832 of these are on 1944 tristrips (3 average tris per strip).
  0 of these are independent triangles.
0 textures, estimated minimum 0K texture memory required.

All nodes are visible and the scene is always the same (with/without shader). You may argue that these are too much nodes, that may be true, but I don’t understand the speed difference. Or is it true that shaders (that do the same operation as fixed function pipline) are always slower?

Maybe I’ve to create a simpler sample first. I just disabled ligting completly and only added this shaders to world

l_position = mul(mat_modelproj, vtx_position);
o_color = float4(0.0, 0.0, 0.0, 1.0);

And fps drops down to 80 fps.

What kind of graphics card do you have? Some graphics cards, like the GeForce 5200, are fine performers in fixed-function mode, but run shaders like an old, three-legged dog.

David

First thanks that you try to help me :slight_smile:

It is an ATI Radeon X1300 (Intel Core Duo 2.13 GHz CPU):

I have assembled a simple application:

Python:

import sys
import direct.directbase.DirectStart
base.setBackgroundColor(0.0, 0.0, 0.2)
base.disableMouse()
SIZE = 20
camera.setHpr(0.0, -90.0, 0.0)
camera.setPos((SIZE - 1) * 0.5, (SIZE - 1) * 0.5, 50.0)
world = render.attachNewNode("World")
for y in range(SIZE):
    for x in range(SIZE):
        planeNode = loader.loadModel("plane.egg")
        planeNode.reparentTo(world)
        planeNode.setPos(x, y, 0.0)
        planeNode.setColor(0.0, 0.0, 1.0, 1.0)
shader = loader.loadShader("example-shader.sha")
base.accept("escape", sys.exit)
base.accept("a", render.analyze)
base.accept("o", base.oobe)
base.accept("1", world.setShader, [shader])
base.accept("2", world.clearShader)
run()

Shader:

//Cg

void vshader(
    uniform float4x4 mat_modelproj,
    in float4 vtx_position : POSITION,
    out float4 l_position : POSITION)
{
    l_position = mul(mat_modelproj, vtx_position);
}

void fshader(
    out float4 o_color : COLOR)
{
    o_color = float4(0.0, 0.0, 0.5, 1.0);
}

Egg:

<CoordinateSystem> { Z-up }
<Group> Plane {
  <VertexPool> Plane {
    <Vertex> 0 {
      0.45 0.45 0.0
    }
    <Vertex> 1 {
      -0.45 0.45 0.0
    }
    <Vertex> 2 {
      -0.45 -0.45 0.0
    }
    <Vertex> 3 {
      0.45 -0.45 0.0
    }
  }
  <Polygon> {
    <Normal> { 0.0 0.0 1.0 }
    <VertexRef> { 0 1 2 3 <Ref> { Plane } }
  }
}

194 fps without shader
141 fps with shader

The X1300 is of the Radeon Express series, an integrated graphics card designed specifically for laptops and the like. As such, it’s not likely to be a screaming performer for shaders (though it does appear to be fairly highly-rated among other integrated graphics cards in the same class for basic fixed-function rendering).

Still, I do see a similar drop on my GeForce 7900 GTX when shaders are enabled. It does bear investigation to make sure we aren’t doing something stupid in regards to setting up the shader parameters. I’ll look more deeply.

David

Thanks for you investigations.

At home I have a GeForce 7900 GTX => FPS 183/133.
The results from a friend (sorry don’t know the GPU) => FPS 195/146.

Just for comparison, on my GeForce 8600 I get these results:
With shader: 230 FPS, Without shader: 150 FPS!
Sounds suspicious indeed, since my card is very well capable of running shaders. I never really noticed this large drop because my framerates are high enough anyways, but still, for older cards or more expensive scenes I think this is critical.

OK, here’s what’s going on.

First, this is a bit of a contrived test. The scene in question has 400 different nodes with 400 different transforms on them, each of which is drawing one quad (which is trivial, of course). So we’re not measuring the graphics card’s performance at all; what we’re actually measuring here is the time it takes Panda to issue new transform calls.

Also, let me point out that the difference between 194 fps and 141 fps is 2 ms; over 400 nodes, that’s 5 microseconds per node. So, it takes Panda 5 microseconds longer to issue the transform when a shader is applied than when it is not.

Why is this? Well, simply because when a shader is in effect, Panda has to re-issue all the shader parameters whenever the transform changes, in case any of them depend on the transform. In this example, in fact, none of them do; but Panda doesn’t currently have any logic to detect that case.

Now, I could write some code to detect when none of the shader parameters depend on the transform, and avoid re-issuing the shader parameters in that case. That would help this particular example, but I don’t know how useful it would be in reality. I suspect most shaders actually do depend on the transform for several of their shader parameters (the original example that prompted this test–passing in the light nodes–certainly does).

And in any case, we’re talking about 5 microseconds here. The real answer, I think is simple: as always, if you’re concerned about performance, use flatten to reduce the number of nodes, transforms, and state changes. It’s going to take a certain about of time to issue each change, so if you have hundreds of state changes needlessly, it’s going to be needlessly slow.

As for actually drawing with a shader vs. drawing fixed-function, well, that performance characteristic is up to your graphics card. If you wanted to measure this performance, you would have just one node and one transform, with a lot of pixels and/or vertices underneath it.

David

It’s true this test is contrived. If you modify the test slightly e.g. colorize each quad differently one can see a similar divergence between shader vs fixed function.
5us to set parameters are IMHO too much. Given a 1 GHz CPU this are 5000 cycles (and with wide dynamic execution up to 4 times more).
I have not yet analyzed the panda source but there’s one thing that comes to mind. OpenGL has a functions to get the Id of shader parameter (e.g. glGetUniformLocation). One may call glGetUniformLocation before each glUnfiorm or only as often as needed (if a parameter name changes, which shoudn’t happen often). I know Panda3D uses Cg but maybe the problem is the same.

Anyway thanks for your reply.

€: Are all 5us spent in CLP::issue_parameters?

I’ve rewritten this “test” in pyglet. I know it’s Python, there is no vertex buffer, etc…

import sys
import time
import ctypes
import pyglet
pyglet.options['debug_gl'] = False
from pyglet.gl import *
from pyglet.app import *
VERTEX_SHADER = """
uniform vec4 myColor;
void main() {
    //gl_FrontColor = myColor;
    //gl_FrontColor = gl_Color;
    //gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
}
"""
FRAGMENT_SHADER = """
void main() {
    //gl_FragColor = gl_Color;
    //gl_FragColor = vec4(1.0, 0.0, 1.0, 1.0);
}
"""
SIZE = 20
fpsStart = time.time()
fpsCount = 0
window = pyglet.window.Window(width = 800, height = 600, vsync = False)
def compile(source, type):
    shader = glCreateShader(type)
    buff = ctypes.create_string_buffer(source)
    c_text = ctypes.cast(ctypes.pointer(ctypes.pointer(buff)), ctypes.POINTER(ctypes.POINTER(GLchar)))
    glShaderSource(shader, 1, c_text, None)
    glCompileShader(shader)
    return shader
vertexShader = compile(VERTEX_SHADER, GL_VERTEX_SHADER)
fragmentShader = compile(FRAGMENT_SHADER, GL_FRAGMENT_SHADER)
program = glCreateProgram()
glAttachShader(program, vertexShader)
glAttachShader(program, fragmentShader)
glLinkProgram(program)
p = glGetUniformLocation(program, "myColor")
#glUseProgram(program)
def on_draw():
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT)
    glMatrixMode(GL_MODELVIEW)
    glLoadIdentity()
    glTranslatef((SIZE - 1) * -0.5, (SIZE - 1) * -0.5, -50.0)
    for y in range(SIZE):
        for x in range(SIZE):
            glPushMatrix()
            glTranslatef(x, y, 0.0)
            #glColor4f(x / (SIZE - 1.0), y / (SIZE - 1.0), 1.0, 1.0)
            #glUniform4f(p, x / (SIZE - 1.0), y / (SIZE - 1.0), 1.0, 1.0)
            glBegin(GL_QUADS)
            glVertex3f(-0.45, 0.45, 0.0)
            glVertex3f(0.45, 0.45, 0.0)
            glVertex3f(0.45, -0.45, 0.0)
            glVertex3f(-0.45, -0.45, 0.0)
            glEnd()
            glPopMatrix()
    global fpsCount, fpsStart
    fpsCount += 1
    fpsCurrent = time.time()
    fpsDelta = fpsCurrent - fpsStart
    if (fpsDelta) > 1.0:
        print "FPS", fpsCount / fpsDelta
        fpsStart = fpsCurrent
        fpsCount = 0
def on_resize(width, height):
    glClearColor(0.0, 0.0, 0.2, 0.0)
    glEnable(GL_DEPTH_TEST)
    glViewport(0, 0, width, height)
    glMatrixMode(GL_PROJECTION)
    glLoadIdentity()
    gluPerspective(45.0, float(width) / float(height), 0.1, 100.0)
    return pyglet.event.EVENT_HANDLED
on_resize(800, 600)
for i in range(1000):
    on_draw()
    window.flip()

There are serveral lines commented inside Python and GLSL. If you uncomment the right ones it is possible to test different cases.

On my NVIDIA card I get the following result:

Fixed Function, No Colors: 202
Fixed Function, Colors: 150
Shader, No Colors: 200
Shader, (glColor4f) Colors: 158
Shader, (glUniform4f) Colors: 142

It may be ok that the framerate drops if I have to call glColor4f (Python overhead, …). But that there is a difference between glColor4f and glUniform4f I really really don’t want to understand (bashing my head on the desk).

I first tested it on the ATI card but the results are worse. The framerate drops if I only activate the shader without doing anything.

Maybe time to say once more that Panda3D makes the most of this disaster.

I was looking a little more closely, myself. Yes, almost all of those 5 usec are spent in GLShaderContext::issue_parameters(). Of that time, roughly half of it represents Panda’s efforts–it’s composing some matrices, etc., to compute the world transform to pass to the shader for mat_modelproj. Looking at the code now, I think it could be made a bit more efficient (but probably not vastly more efficient).

The bad news, though, is that the other half of those 5 usec is spent within cgGLSetMatrixParameterfc(). I have no idea what that function is doing internally, but whatever it is, it isn’t cheap–and it’s certainly necessary.

It’s possible this is just a flaw in Cg, and perhaps the equivalent calls in GLSL and HLSL wouldn’t suffer from this same cost. But I doubt it.

But again, this is consistent with the general trend of 3-D graphics technology. Individual operations (load state, load transform, issue draw command) haven’t gotten much faster over the years. What’s gotten better is the complexity of each operation. That is, modern graphics cards can process many thousands of vertices per Geom, vastly more than they could handle just a few years ago; and they can handle more and more complex shader programs every year. But they don’t really handle any more individual Geoms than they used to.

So, optimizing for the modern graphics card generally means limiting the number of Geoms and transforms in your scene. This is, of course, difficult to do, because it is precisely those qualities–lots of Geoms and lots of transforms–that make a scene dynamic and interesting. Panda provides some tools to try to help with this (flattenMedium and flattenStrong, egg-palettize, RigidBodyCombiner, PStats, etc.), but it is always going to be an effort.

David