performance issues: strategies needed

SylHar · September 3, 2009, 9:34am

Hi.
I came for a counsel this time.

For some time, we try to improve the performance of our program. It start to become quite big, and quite slow.

I’ve try to improve it with different technique so far, but it is still lacking in a lot of field.
My scene is a big city, with a lot of element, that can be seen for either really close view, in a top_down view(the most used).
It can used more than 50000 independent object ( and we had to stop there because we crash/ became unusable).

here a little render.analyze(): ( 3 fps)

> 41516 total nodes (including 0 instances); 0 LODNodes.
> 13996 transforms; 33% of nodes have some render attribute.
> 14144 Geoms, with 416 GeomVertexDatas and 4 GeomVertexFormats, appear on
> 13972 GeomNodes.
> 41851 vertices, 41851 normals, 0 colors, 41851 texture coordinates.
> 10018 normals are too long, 0 are too short.  Average normal length is
> 1.00531
> GeomVertexData arrays occupy 2285K memory.
> GeomPrimitive arrays occupy 143K memory.
> 19 vertices are unreferenced by any GeomPrimitives.
> 70 GeomVertexArrayDatas are redundant, wasting 203K.
> 39 GeomPrimitive arrays are redundant, wasting 17K.
> 1227407 triangles:
>    167782 of these are on 49711 tristrips (3.37515 average tris per strip).
>    1059625 of these are independent triangles.
> 153 textures, estimated minimum 245653K texture memory required.

One thing I try to improve the performance is the flattenStrong. Since I must still have interactivity with some of the mesh, I can’t really do a flatten at root, but I can still flatten heavily in some part.

then I get: ( 10 fps)

> 9833 total nodes (including 0 instances); 0 LODNodes.
> 173 transforms; 1% of nodes have some render attribute.
> 5179 Geoms, with 4765 GeomVertexDatas and 3 GeomVertexFormats, appear on
> 4815 GeomNodes.
> 1789297 vertices, 1789297 normals, 0 colors, 1789297 texture coordinates.
> GeomVertexData arrays occupy 98000K memory.
> GeomPrimitive arrays occupy 6841K memory.
> 3953 GeomVertexArrayDatas are redundant, wasting 9415K.
> 4643 GeomPrimitive arrays are redundant, wasting 5077K.
> 1208379 triangles:
>    137984 of these are on 39757 tristrips (3.47068 average tris per strip).
>    1070395 of these are independent triangles.
> 156 textures, estimated minimum 250261K texture memory required.

It’s a first step which could be great, except for this lose of memory. On other scene, the loss became larger and start to become unusable.

With a pstat, I see that( without the flatten):

Which are my possibility to improved performance ?

Since I mostly use top_down view, I don’t think a quad_tree can be that helpful, …
I can used Lod node of course, but It seems a lots of work, and I’m not sure it can improve performance that much, since my mesh are still low poly.
so I’m not sure what I can do,…

ditus · September 3, 2009, 12:40pm

yes, first off, why you use independ objects, the object overhead could be one of the problem. cluster them so far you can. if you been most of the time in a top view, you could do some render to texture (for your enivorment to render it on a plane) which get updated not each frame.
and so on… there lot of things you can do, to reach a better performance.
at your vert count, you dont need lods. if you have a collision detection in, make collision meshes (reduced meshes) for your root geometry.

greetz
dirk

SylHar · September 3, 2009, 1:17pm

thank for the tips.
I think there something to do with the render to texture, good idea.

I don’t think I can cluster them more, I need the current node hierarchy to hide / show part of the world or even to select things.
(My app is more a level editor than a game so I need to have individual object, and I don’t know which kind of object the user will use ( I have restriction so that my shader system works but that’s all))

I are no collision in this part, so that’s fine

drwr · September 3, 2009, 2:54pm

The bad news for you is, you have to find a way to flatten more. The PStats graph shows you drawing over 11,000 Geoms per frame; that is simply too many for almost any hardware (you should be targeting on the order of 300). You don’t show the equivalent PStats without that flattenStrong(), but your render.analyze() table there shows that you have reduced the total Geom count to about 5,000, and improvement, but you really must do better.

Of course you also want to maintain interactivity. This is the fundamental battle of performance tuning: you need small, individual pieces to maintain interactivity, but your graphics hardware needs large monolithic chunks to maintain performance. (Ever wonder why graphics card manufacturers usually just brag about their card’s performance in terms of vertices per second? They’re just counting the vertices as one big chunk. It’s a very different picture in any realistic environment.)

You could try using the RigidBodyCombiner over large areas of your scene. This may reduce Geom count while maintaining interactivity, but it comes at the cost of additional CPU to animate vertices, so I’m not sure how far it will get you. Certainly worth investigation, though.

Does your scene have lots of the same model repeated many times? If so, you could benefit from hardware instancing, which is not yet easily supported by Panda, but which we hope to add in the future. This would limit you to running on fairly recent hardware, though.

Other than that, I suspect you’ll have to go to some clever tricks with geometry swapping. For instance: Take a large section of geometry and make a copy of it, and then flattenStrong that copy. Put the copy onscreen, and keep the original tucked away. The user can’t move around the pieces of the copy, because they’re all cemented together; but that’s why you keep the original around. When the user starts to interact with it, replace the original back in the scene.

David

rdb · September 3, 2009, 3:15pm

Keep in mind that if you have many models from disk and want to flatten them together, you really need to call clearModelNodes() or flatten won’t have any effect.

SylHar · September 4, 2009, 2:15pm

thank for the idea.
I’m looking at RigidBodyCombiner for now since it’s an idea that I haven’t try.
I’m not sure I understand how to use it. I’ve try to put 1 rigidBodyCombiner at the root of my scene and call collect(). but it didn’t change anything.

rbc = NodePath(RigidBodyCombiner('test'))
render.getchildren().reparentTo(rbc)
rbc.reparentTo(render)
rbc.collect()

then I try I different part of my scene that I can easily find, but that doesn’t change anything either.

Am I doing something wrong ?

drwr · September 4, 2009, 3:12pm

Did it not? It won’t change the results of analyze(), but it should change the results of PStats.

David

SylHar · September 7, 2009, 8:19am

yeah. Since neither the analyse nor the fps were changed, I forgot to check the pstat.

But they aren’t the same:

drwr · September 7, 2009, 2:21pm

Right, so, it’s working; but it’s just trading GPU time for CPU time, and it appears to be about an even trade. The RigidBodyCombiner is more useful for small-scale things than for an entire scene like this.

You’ll probably have to try one of the other suggested approaches.

David

SylHar · September 24, 2009, 2:20pm

Hi !
I’m back after a lot of optimisation of the code.
Most of them are still in progress, but the result are not bad !

Current result:

217 total nodes (including 0 instances); 0 LODNodes.
75 transforms; 0% of nodes have some render attribute.
81 Geoms, with 73 GeomVertexDatas and 1 GeomVertexFormats, appear on 63 GeomNodes.
1779014 vertices, 1779014 normals, 0 colors, 1779014 texture coordinates.
GeomVertexData arrays occupy 97433K memory.
GeomPrimitive arrays occupy 6854K memory.
12 GeomPrimitive arrays are redundant, wasting 1476K.
1207652 triangles:
  130092 of these are on 36550 tristrips (3.55929 average tris per strip).
  1077560 of these are independent triangles.
136 textures, estimated minimum 225142K texture memory required.

81 geom that should be good, shouldn’t it?
I think I can find way to divide that by 4 if that still important !

I’m at 34 fps while looking at the whole city.
That’s not bad, but since it’s still a ‘small’ level I want to have better result.

Should I divide the number of geom again ? ( by fusionning 4 texture into One big texture, and changing the tex coord in the mesh, so that 4 meshes can be used as one. )
How efficient a method like that can be ?

Strange fact: if I put a render.setShaderOff(1000), fps rise to 48 fps !!
But If I just simplify my shader ( to the point that they only draw a texture without any other operation), fps don’t up more that 2 or 3.
I thought that nowadays graphic pipeline used shader in all cases.
Why disable them improve the render that much ?

rdb · September 24, 2009, 2:47pm

Are you using the shader-generator somewhere? It might be that it generates a lot of shaders every frame.

If it’s your own shaders, optimize them. For example, ‘if’ statements are expensive, replace them by other equivalents, optimize your memory usage, consider using halfs instead of floats, remove unnecessary texture samples, etc.

When programming shaders it’s very important to make them as optimal and fast as possible, keep in mind that these shaders have to be executed millions of times every frame!

drwr · September 24, 2009, 2:56pm

Also, a few graphics cards, especially integrated graphics cards, or cards a few years old, simply don’t render shaders as quickly as they render the fixed-function pipeline. The GeForce 5200 was particularly notorious for this problem. What kind of card do you have?

81 Geoms is great, bringing it lower probably won’t help. In fact, it’s possible you’ve brought it down too low (!), which can impede the effectiveness of view-frustum culling, which removes pieces of geometry that are completely outside the view frustum. If you flatten things together too aggressively, then your entire scene is always partially intersecting the view frustum, and nothing can be culled. This can lead to a performance degradation when you start to run into the vertex processing limit of your GPU.

Of course, if you’re viewing the entire scene from above, everything is within the view frustum and nothing can be culled anyway, so that’s not an issue for you in this particular case. But improving the culling effectiveness might help when you’re within the scene. You can visualize the culling effectiveness by using base.oobeCull(), and then using the trackball mode to pull your point of view outside of your body while you walk around–Panda will then cull the scene as if you were still at your original point of view, and you can see how much stuff is drawn outside of the view frustum.

David

SylHar · September 24, 2009, 4:15pm

thank for your answer.

@drwr

I have a 8800GT so I don’t think that a problem. ( the application won’t run on old graphic card anyway).
Yes Culling might be a problem, I wil deal with that, but As I said, about 75% of the time the whole scene will be viewed, or nearly the whole.

@pro-rsoft
No I don’t use the shader generator, only custom shaders. ( 4 of them, with 90% of one used)
It is not a heavy computation shader, only normal map/light map/diffuse with phong shading,…
Like I said, If I reduce almost all the operation I don’t see any reel difference in the fps, I just found that strange.

Anyway, I will post pstat as soon as I can ( tomorrow morning, gmt wise probably ), but no idea on what I can do improve ?
I should have more than 60 fps for the scene ! or is it a hardware/panda limit that can’t be easily improved?

rdb · September 24, 2009, 4:20pm

Are you seeing a nearly-constant 60 fps? Then you are probably just running into the vsync, try putting this in Config.prc:

sync-video 0

SylHar · September 25, 2009, 7:40am

no, I’m at 300 fps before loading the scene.

SylHar · September 25, 2009, 1:04pm

Ok, here the complete stats.

I try a bigger city. so we’re talking of 25000 mesh and 3 500 000 triangles now for less than 15 fps.
(it’s still a flatten at root for now, I’m trying to get my maximum fps that possible for viewing the whole city Even If I can’t keep that in a more standard situation:

250 total nodes (including 0 instances); 0 LODNodes.
106 transforms; 0% of nodes have some render attribute.
114 Geoms, with 102 GeomVertexDatas and 1 GeomVertexFormats, appear on 64 GeomNodes.
3488821 vertices, 3488821 normals, 0 colors, 3488821 texture coordinates.
GeomVertexData arrays occupy 191068K memory.
GeomPrimitive arrays occupy 13372K memory.
37 GeomPrimitive arrays are redundant, wasting 4624K.
2358527 triangles:
  254116 of these are on 69669 tristrips (3.64748 average tris per strip).
  2104411 of these are independent triangles.
136 textures, estimated minimum 225142K texture memory required.

It seems that my time is essentially spend in draw now. so that make sense that my shader cost too much.
I don’t think there are that many mesh occluded so I don’t think a quadtree or occlusion algorithm can help me. maybe fewer/smaller textures ?

drwr · September 25, 2009, 2:35pm

Hmm, that “draw” graph is surprisingly spiky for a static scene. I wonder why.

If it’s 15 fps with your shader, how is it with the fixed-function pipeline? (Is it also spiky there?)

How about if you disable textures? In the fixed-function pipeline, you can easily do this with base.textureOff(). In your shader version, of course, you have to modify your shader.

Let’s see if we can figure out where the time is being spent.

3.5 million vertices is quite a few; it seems likely that you’re spending most of your time transforming vertices. If so, then disabling lighting should help. In the fixed-function pipeline, use something like render.setLightOff(100). You could also try disabling backface culling, which improves transform time at the cost of fill time: base.basefaceCullingOff(). If either or both of these makes a difference (or, in the shader case, simplifying your vertex shader makes a difference), then transform time is your bottleneck.

Your screenshot shows mostly gray background pixels, so it doesn’t seem likely that you’re fill-limited, but for fun you can try reducing the size of your window to see if that improves your framerate. If it does, it suggests that your pixel shader (and your textures and such) is your bottleneck.

David

SylHar · September 25, 2009, 3:25pm

all on   800/600                        : 15 fps
all on   1600/1200                      :  2 fps

shaderOff    800/600                    : 20 fps
shaderOff    1600/1200                  : 16 fps

shaderOff TextureOff lightOff 800/600   : 60 fps
shaderOff TextureOff lightOff 1600/1200 : 48 fps

TextureOff lightOff 800/600             : 50 fps
TextureOff lightOff 1600/1200           :  5 fps !!

lightOff    same as all_on

TextureOff  same as TextureOff LightOff

It does seems that my problem is with my shader, does it really that expensive ?
My shader for ( 90% of the meshes)

//Cg
//
//Cg profile arbvp1 arbfp1



void vshader(   in float4 vtx_position                    : POSITION,
                in float2 vtx_texcoord                    : TEXCOORD0,
                in float3 vtx_tangent0                    : TANGENT,
                in float3 vtx_binormal0                   : BINORMAL,
                in float3 vtx_normal                      : NORMAL,

                in uniform float4 mspos_view,
                in uniform float4 k_light0,
                in uniform float4x4 mat_modelproj,
                in uniform float4x4 trans_world_to_model,

                out float4 l_position                     : POSITION,
                out float2 l_texcoord                     : TEXCOORD0,
                out float3 l_light0                       : TEXCOORD1,
                out float3 l_pos0                         : TEXCOORD2
            )
{
  l_position   = mul(mat_modelproj, vtx_position);
  l_texcoord   = vtx_texcoord;

  //transpose into object space
  l_light0     = mul(trans_world_to_model,-k_light0 ).xyz; 
  l_pos0       = mspos_view.xyz - vtx_position.xyz;

  //transpose into texture space
  float3x3 TBN = float3x3(vtx_tangent0, -vtx_binormal0, vtx_normal);

  l_pos0       = mul(TBN,l_pos0) ;
  l_light0     = mul(TBN,l_light0) ;
}

void fshader(   in float2 l_texcoord       : TEXCOORD0,
                in float3 l_light0         : TEXCOORD1,
                in float3 l_pos0           : TEXCOORD2,

                uniform sampler2D tex_0    : TEXUNIT0, //texture
                uniform sampler2D tex_2    : TEXUNIT1, //ambiantmap
                uniform sampler2D tex_1    : TEXUNIT2, //normalmap

                out float4 o_color : COLOR
            )
{
    half3 normalTex         = ( tex2D(tex_2, l_texcoord).xyz - 0.5 ) * 2;
    half3 HalfAngle         = (l_light0 + l_pos0)/2.0 ;

    half factor_diffuse    = max(0.2,saturate(dot(normalTex, l_light0)));
    half diffuse           = half(tex2D(tex_1,l_texcoord) * factor_diffuse) ;

    half specular          = half(saturate(dot(normalTex,HalfAngle)));
    specular               = pow(specular,16) ;

    half4 color            = tex2D(tex_0,l_texcoord);

    half brightness = max (0.1, diffuse + 0.3 * specular );
    o_color          = brightness * color ;
    o_color.w        = color.w;
}

SylHar · September 28, 2009, 1:53pm

I continue my quest to find my bottleneck.

I try to change all texture to use only one ( one diffuse, one normal map, and one light map)

then I don’t get the slow down ( no difference in fps between low and high rez)

It seems that this line in analyze show the problem:

123 textures, estimated minimum 207222K texture memory required.

Are my textures too big ( nearly all of them are 1024by1024) ?

The system will be faster with less texture ? or smaller texture ?
Fewer but bigger texture ?

rdb · September 28, 2009, 1:58pm

Well, you could at least try to pack them together a bit. For example, by packing gloss/glow/light maps in the alpha channel of the texture map. (If you’re using the ShaderGenerator, you can do something similar using the MModulateGlow/MModulateGloss modes.)