Trying to understand Performance

Hi,

I am writing a simple simulation where, i spread over 400 sheeps in a geomipmapped terrain and around 30 shepherds, the job of the shepherds is to walk around and tag the sheep.

I am using the ralph model as the shepherd and the panda as the sheep for now. One problem i am running into is my frame rate drops drastically as I zoom out and see more of the terrain.

I am pasting below what the text-pstats shows me

  Draw = 74.0662 ms
    Flush = 0 ms
    Clear = 0.198364 ms
    Transfer data = 0.167847 ms
      Texture = 0 ms
      Index buffer = 0 ms
      Vertex buffer = 0.167847 ms
    Make current = 4.32587 ms
    window1 = 69.3741 ms
      dr_0 = 69.0384 ms
        transparent = 0.488281 ms
        opaque = 68.4967 ms
      dr_1 = 0.0457764 ms
        unsorted = 0 ms
      dr_2 = 0.0152588 ms
      dr_3 = 0.0991821 ms
        transparent = 0.0915527 ms
  Cull = 2.02942 ms
    Sort = 0.0152588 ms
      transparent = 0 ms
      opaque = 0.00762939 ms
    Setup = 0.0457764 ms
    window1 = 1.96838 ms
      dr_0 = 1.6861 ms
      dr_1 = 0.038147 ms
      dr_2 = 0.0152588 ms
      dr_3 = 0.0152588 ms
  App = 41.7175 ms
    Show code = -0.495911 ms
      General = -3.24249 ms
      ivalLoop = 0.801086 ms
        pandaPace = 0 ms
          LerpPosInterval = 0 ms
          LerpHprInterval = 0 ms
        LerpHprInterval = 0.717163 ms
      DIRECTContextTask = 0.0228882 ms
      audioLoop = 0.0228882 ms
      igLoop = 1.69373 ms
      collisionLoop = 0.0228882 ms
      dataLoop = 0.0534058 ms
      resetPrevTransform = 0.00762939 ms
      eventManager = 0.0610352 ms
      SpinCameraTask = 0 ms
      gameLoop = 0.0152588 ms
      updateTerrain = 0.00762939 ms
      dragTask = 0.038147 ms
      renderStats = 0 ms
    Collisions = 6.75201 ms
      Reset = 0.343323 ms
      ctrav = 6.40869 ms
        pass1 = 6.19507 ms
      DirectSelection = 0 ms
        pass1 = 0 ms
    render_frame = 1.91498 ms
    Yield = 0 ms
    World = 11.6425 ms
      update = 11.5204 ms
      handleCollisions = 0.12207 ms
    Terrain = 3.92151 ms
      clampToValidPos = 3.92151 ms
    sheep = 17.1585 ms
      doAutoMove = 11.6043 ms
      doAutoFollow = 1.2207 ms
      move = 4.3335 ms
    shepherd = 0.823975 ms
      move = 0.823975 ms
  Wait = 0.900269 ms
    Flip = 0 ms
    Thread sync = 0 ms
    Thread block = 0.900269 ms
  * = 12.6877 ms
    Animation = 0.892639 ms
      panda_walk_character = 0.534058 ms
        Joints = 0.137329 ms
        Skinning = 0.221252 ms
      Ralph = 0.358582 ms
        Joints = 0.0839233 ms
        Skinning = 0.198364 ms
    PStats = 0.686646 ms
    Munge = 9.25446 ms
      Geom = 9.25446 ms
    Show fps = 0.038147 ms
    Bounds = 1.64032 ms
    do_frame = 0.175476 ms
    Premunge = 0 ms
    Generate Text = 0 ms
  Nodes = 169 
    GeomNodes = 69 
  Geoms = 73 
  SW Sprites = 0 K
  RenderStates = 66 
    Cached = 17 
    On nodes = 13 
  TransformStates = 949 
    Cached = 249 
    On nodes = 697 
  Occlusion results = 0 
    Not tested = 0 
    Visible = 0 
    Occluded = 0 
  Occlusion tests = 0 
  System memory = 76.7745 MB
    MMap = 14.125 MB
      NeverFree = 9.16952 MB
        Unused = 0.0671616 MB
        Active = 9.10236 MB
          GLVertexBufferContext = 0.114609 MB
          GLIndexBufferContext = 0.0602913 MB
          PythonTask = 0.00196838 MB
          NodePathComponent = 0.0588608 MB
          CullableObject = 0.0030632 MB
          GeomTextGlyph = 0.0105057 MB
          BoundingSphere = 0.233459 MB
          GeomVertexData = 0.321316 MB
          GeomVertexArrayData = 0.386505 MB
          GeomPrimitive = 0.0951309 MB
          Geom = 0.199013 MB
             = 0.0856018 MB
              CacheEntry = 0.0856018 MB
          TransformState = 0.220829 MB
          RenderState = 0.0256805 MB
          CopyOnWriteObj<7pvectorIN8GeomNode9GeomEntryEE> = 0.103203 MB
          CopyOnWriteObj1<6ov_setIN9PandaNode12UpConnectionESt4lessIS1_EE> = 0.313545 MB
          CopyOnWriteObj1<11ov_multisetIN9PandaNode14DownConnectionESt4lessIS1_EE> = 0.62709 MB
          pset = 0.103157 MB
          pmap = 6.14476 MB
          NodeReferenceCount = 0.0389328 MB
          Other = 0.0179787 MB
          LVecBase3f = 0.0149689 MB
          GLGeomMunger = 0.00508881 MB
          GLTextureContext = 0.00240326 MB
        Inactive = 0 MB
          Other = 0 MB
    Heap = 62.6495 MB
      External = 0 MB
      Array = 52.0351 MB
        Other = 0.00133514 MB
        pvector = 21.7857 MB
        ov_set = 5.86221 MB
        Texture = 0.0944452 MB
        AnimChannelMatrixXfmTable = 0.0372505 MB
        VertexDataBuffer = 23.6103 MB
        int = 0.0301857 MB
        pdeque = 0.00786209 MB
        PandaNode = 0.0498695 MB
        PartGroup = 0.131023 MB
      Single = 10.6145 MB
        Other = 0 MB
      Overhead = 0 MB
  Collision Volumes = 12120 
    CollisionGeom = 0 
    CollisionInvSphere = 0 
    PandaNode = 12120 
    CollisionPlane = 0
    CollisionPolygon = 0 
    CollisionSolid = 0 
    CollisionSphere = 0 
    CollisionNode = 0 
    GeomNode = 0 
    Geom = 0 
    CollisionTube = 0 
  Collision Tests = 0 
    CollisionGeom = 0 
    CollisionInvSphere = 0 
    CollisionPlane = 0 
    CollisionPolygon = 0 
    CollisionSolid = 0 
    CollisionSphere = 0 
    CollisionTube = 0 
  Vertex Data = 23.6103 MB
    Disk = 0 MB
      Used = 0 MB
      Unused = 0 MB
    Compressed = 0 MB
    Resident = 0 MB
    Pending = 0 MB
    Small = 0.0289268 MB
    Independent = 23.5814 MB
  Vertex buffer switch = 0 
    Index = 0 
    Vertex = 0 
  Data transferred = 0.239166 MB
  Primitive batches = 0 
    Display lists = 0 
    Triangle strips = 248 
    Triangle fans = 0 
    Triangles = 39 
    Other = 0 
  Vertices = 228.204 K
    Immediate mode = 0 K
    Display lists = 0 K
    Triangle strips = 10.014 K
    Triangle fans = 0 K
    Triangles = 218.19 K
    Other = 0 K
  State changes = 145 
    Textures = 24 
    Transforms = 69 
  Graphics memory = 31.1806 MB
    context1 = 31.1806 MB
      Active = 6.90921 MB
        texture = 5.13118 MB
        vbuffer = 1.63936 MB
        ibuffer = 0.138668 MB
      Thrashing = 0 MB
        texture = 0 MB
        vbuffer = 0 MB
        ibuffer = 0 MB
      Inactive = 24.2714 MB
        texture = 3.11458 MB
        vbuffer = 20.2683 MB
        ibuffer = 0.888512 MB
      Nonresident = 0 MB
        texture = 0 MB
        vbuffer = 0 MB
        ibuffer = 0 MB
  Geom cache size = 1584 
    Active = 73 
  Geom cache operations = 0 
    evict = 0 
    erase = 0 
    record = 0 

</snip>

The part I want to understand is why does "Draw" take so much more time and two why does my the following part take like 18ms i would expect it to be far less.

    sheep = 17.1585 ms
      doAutoMove = 11.6043 ms
      doAutoFollow = 1.2207 ms
      move = 4.3335 ms

Further these functions are called irrespective of whether a sheep is visible or not, so I would expect the time spent here to be a constant but it seems to go up a lot when more sheep are visible.

All the doAutoMove code does is the follow

  @pstat_obj("sheep")
    def doAutoMove(self):
        elapsedTime = globalClock.getDt()
        posTransform = self.actor.getNetTransform().getMat().getRow3(1)
        posTransform.setZ(0)
        posTransform.normalize()
        newPos = self.actor.getPos() -  posTransform*(elapsedTime*self.speed)                
        self.actor.setPos(self.world.terrain.clampToValidPos(newPos))

The clampToValidPos does the following
    @pstat_obj("Terrain")
    def clampToValidPos(self,pos):
        if (pos.getX()<0):
            pos.setX(0)
        if (pos.getY()<0):
            pos.setY(0)
        if (pos.getX() > self.mapSizeX):
            pos.setX(self.mapSizeX-1)
        if (pos.getY() > self.mapSizeY):
            pos.setY(self.mapSizeY-1)
        elevation = self.terrain.getElevation(pos.getX(),pos.getY()) * self.mapScaleZ
        pos.setZ(elevation)
        return pos

Am I doing something wrong?

how much memory do you have / is your graphics card good ? or any new drivers available for your graphics card?

Its a MacBook Pro 2.4ghz, drivers I haven’t checked… Will update that. Any ideas about the doAutoMove code though? Why its expensive, that shouldn’t be affected by graphics card performance at all. It should be purely cpu bound.

some hardware specs i forgot to give

  • macbook pro 2.4ghz dual core
  • nvidia 8600M GT 256MB
  • 4 gig RAM

400 sheep is a lot of individual objects to draw, though I’m surprised I’m not seeing that count in the PStats (it only shows 73 Geoms in your scene). So either we are not seeing all of your sheep onscreen at once, or you are doing something like a RigidBodyCombiner to flatten them together.

18ms to compute 400 sheep means only 0.05ms per each sheep. That function could easily consume 0.05ms (actually, I’m surprised it’s not much higher). To reduce the cost of the function, you can either vastly simplify this function, or reduce the number of times you have to call it (for instance, stagger it per sheep so that you only call it once every five frames for each sheep).

You have a lot of vertices in your scene: 228 thousand. I guess your sheep have a lot of vertices in each one? Normally, vertex count doesn’t matter that much, but perhaps this is a key factor here. What happens to your frame rate if you use a simpler model for your sheep?

David

The reason its showing 73 is probably because I was not zoomed out all the way so not all were visible. I am not doing any flattening. I am using the default panda model that comes with the demo, it has about 2000 triangles though none of them are in a strip.

Ralph model that comes with the ralph tutorial has on an average I think 3500 triangles.

I am using a 256*256 height map for GeoMipTerrain with a block count of 32.

Will try and stagger the updates and see if that helps

Hmm, I am horrible with blender:-. These were the two models I could find that came with Panda itself:). Is there a simple model that you can point me to that I can just plug in. How do I make say a character out of a simple say box model? When I try to load the box model as an actor panda complains its not a character.

Further, the Mac Build seems to be broken, quite a lot of utilities complain about missing dynlibs because the rpath in the binary is not set correctly. punzip refuses to work. If i try the latest snap the utilities work but panda breaks:-.

A question I have, One alternative I am thinking of is to not render full actors when the camera is zoomed out more than a certain distance.

I am wondering if there is an easy way to create a billboard out of an actor model.

What I want to do is something like this

if (cameraDist>max && !ShowingBillboard)
self.actor.stash()
self.billboardobj.unstash()
ShowingBillboard = True
elif (ShowingBillboard)
self.bullboardobj.stash()
self.actor.unstash()
ShowingBillboard = False

correction stash()/unstash() will probably have to be hide()/unhide() as i still want them to be collidable.

Much easier to use an LODNode, which is exactly for this purpose. You can take your two different models and parent them both to the LODNode, and then set the switch distance appropriately. But note that billboards can be expensive too; it might be better just to have a lower-polygon model.

This is, of course, assuming that the vertex count really is the issue. It’s worth some tests to prove this first.

David

I really doubt it’s the vertex count. I have the same GPU, and it can render millions of em without breaking a sweat.

Blender’s exporter supports LOD. I don’t know if you can do LODing for actors, though…

So I changed my code to do the following, when zoomed in display the actor, when the camera distance > maxDist hide the actor and just display a text Node which shows its name.

This seems to help a bit but not as much as I would expect. The culling time is still quite high ( though the window render() time goes down quite a bit )

I guess I might have to restrict the camera to not be zoomed out too much.

I am also thinking of implementing a way to do partial world updates. Rather than update the whole world in every frame, I would remember state and update only parts of the world every frame.

Btw does threading the code make use of 2 cores? Or is it still cooperative threading?

I do not think Vertex count is the main issue but more like the culling+just rendering of 400 individual models has too many state changes maybe? I ran it on windows while watching the graph view of pstats and here’s what i observed.

My app would spend about 17-20ms per frame.
When zoomed in cull/window times would be in the few ms totally.

As I zoom out cull/window times would go up almost 1:1, so my time would be split like 20-30ms app/20-30 cull/20-30 window.

If anyone is interested in playing with the code and seeing I can upload a tar of the code.

Right, if you have a high Cull time, you probably have too many individual nodes, with too many transforms, and a too-flat hierarchy.

400 sheep moving around independently is going to cause you this kind of trouble. You can reduce this by grouping some sheep together under a common node, according to their proximity; but it will still cause the same problems when you zoom out to see the whole scene. Unless you can then use LODNodes to render a particle system or something that resembles sheep from a distance, instead of an actual cluster of individual sheep.

David

Hmm maybe not a bad idea. Will give that a shot. Thanks for your suggestions. Though I still need to figure out a way to cut down the cpu time for a the move function. Given that I am having trouble with just a simple move taking so much time with 400 actors. I wonder how it will scale for a large scale RTS where I have to do much more than just move randomly.

If your scene isn’t too geometrically challenging, why not just disabling all culling and let the graphics card sort it out?

The category is called “Cull” in PStats, but it’s misnamed. It’s not just time spent culling the scene; that’s trivial. The time spent in Cull is time spent visiting all of the nodes, discovering what is to be rendered, accumulating state and transforms, building up the render list in order, and generally doing all of the work on the CPU that must be done before geometry can be rendered. It’s not optional.

You can disable culling, but that won’t have much impact on the “Cull” time in PStats. (In fact, disabling culling often increases the “Cull” time, because it means more nodes must be visited.)

David

For now, I think I will just go by restricting the amount a user can zoom out and use like what Supreme commander does, where it replaces models with 2d icons as you zoom beyond a certain point.