Yet Another Performace question : a looooot of cubes

Hi guys, this is my second question.

I’m using Panda3D to create a visualizer for scientific data.
Now I have to draw a lot of cubes; the number goes from 5.000 to 10.000.

After reading manual and similar questions, I tried 2 ways to draw them:

  • Instancing + Rigid Body Collector
  • Creating a unique geometry containing all of them using the VertexWriter

I attached the example code I wrote, with CMakefile. It should compile fine if you wanna try it. I made my effort to comment it as much as possible, I think it’ll be understandable.
Both of them are quite slow, around 10 fps.
What am I doing wrong?
About the VertexWriter method, is there a way to create a huge vertex vector once, then tell Panda to use it only in the range [0, max) ?
By using rigid body collector it seems that I can change the properties of the single cubes (position, colors etc) but I can’t hide them anymore once added.
In the example I force the last 20 frames to draw only 10 cubes, but all the previous ones are still visible, even if I call hide() on them.

My main issue is that the data comes from an hardware sensor looking at the real world, and these data are refreshed frame by frame.
So I cannot make any assumption about what the next frame will look like, how many cube there will be or where they will be.
Basically each frame is a brand new one and I have to draw what I get.

I’m only interested in position, size and color of cubes, no other things like textures, lights and so on.
I also don’t care about detecting collisions between cubes (or between cubes and other stuff).
My understanding is that collisions are not checked by default, so I shouldn’t do anything particular to disable them, is it right? (The only collision I may be interested in is to allow selection by mouse click, but I think it won’t required in the end.)

I’ll have later to draw a similar number of billboard cards and other stuff, so I really need the cubes doesn’t drain all the time.

Any tips about how to speed up things are very welcome.
performance_test.zip (7.2 KB)

The RigidBodyCombiner is not very efficient. It would be better to use hardware instancing. This requires a shader. If you feel adventurous, though, you can try the master branch of Panda and use InstancedNode, which is a more automatic way to do instancing (sample here).

The GeomVertexWriter approach should also be effective, though, and will be more widely supported. There are a couple of important optimizations you could apply in your code:

  • You didn’t call set_cube_number to preallocate of the right size, so each call to add_data# may resize the array.
  • set_cube_number is implemented incorrectly, the < should be a >
  • If you presized the array correctly, you can use set_data# instead of add_data#. The former is the same as the latter except it doesn’t check whether it’s running off the end of the array, so do be careful that you presized it correctly.
  • You should create the GeomVertexWriter etc. every frame after setting the number of rows. Not important for performance, but the writer is invalidated when you change the number of rows, and keeping the writer around will cause deadlocks if you try to use the multithreaded render pipeline.
  • Use UH_stream for data that is destroyed and recreated every frame.
  • Not sure why you are allocating a new GeomVertexData in add_cube, and then not using it. You should remove those lines.

After making those changes, it was still running slow, so I took a look with perf (a sampling profiler; Visual Studio would also work if you’re on Windows). One thing that jumps out it is that (1) a significant amount of overhead is in the thread locking code in Panda’s allocation system, and (2) most of the cost is now in the GeomPrimitive construction:

-   92.48%     0.12%  test_perf_write  test_perf_writer  [.] draw_stuff_writer
   - 92.36% draw_stuff_writer
      - 89.77% CubeList::draw
         - 89.73% CubeList::set_vertex_vector
            - 89.54% GeomPrimitive::add_vertices
               + 88.86% GeomPrimitive::add_vertex
      + 2.02% CubeList::add_cube

So I rewrote set_vertex_vector like this:

    void set_vertex_vector()
    {
        auto handle = prim->modify_vertices_handle(Thread::get_current_thread());
        handle->unclean_set_num_rows(m_cube_num * 12 * 3);

        uint16_t *ptr = (uint16_t *)handle->get_write_pointer();

        size_t idx = 0;         // idx is the vertex index at which the next cube starts

        // Is there a way to create a huge vertex vector once, the tell Panda to
        // use it only in the range [0, m_cube_num*8) ?
        for(size_t cube=0; cube<m_cube_num; cube++)
        {
            //            vertex_array;
            // bottom face
            *ptr++ = idx+2; *ptr++ = idx+1; *ptr++ = idx+0;
            *ptr++ = idx+3; *ptr++ = idx+2; *ptr++ = idx+0;

            // upper face
            *ptr++ = idx+4; *ptr++ = idx+5; *ptr++ = idx+6;
            *ptr++ = idx+4; *ptr++ = idx+6; *ptr++ = idx+7;

            // front face
            *ptr++ = idx+0; *ptr++ = idx+1; *ptr++ = idx+4;
            *ptr++ = idx+1; *ptr++ = idx+5; *ptr++ = idx+4;

            // back face
            *ptr++ = idx+3; *ptr++ = idx+6; *ptr++ = idx+2;
            *ptr++ = idx+3; *ptr++ = idx+7; *ptr++ = idx+6;

            // left face
            *ptr++ = idx+4; *ptr++ = idx+3; *ptr++ = idx+0;
            *ptr++ = idx+4; *ptr++ = idx+7; *ptr++ = idx+3;

            // right face
            *ptr++ = idx+1; *ptr++ = idx+2; *ptr++ = idx+5;
            *ptr++ = idx+2; *ptr++ = idx+6; *ptr++ = idx+5;

            idx+=8;
        }
    }

Now it’s running dramatically faster. If you need it to be even faster, you could avoid the overhead of GeomVertexWriter altogether by also directly writing to the GeomVertexData by raw pointer, but that gets slightly more complicated.

Yes, that is a good idea. You can pre-create the oversized GeomPrimitive as desired, and then call modify_vertices(num) where num is the number of vertices that you want Panda to use.

As for collisions, you don’t really pay for what you don’t use. You can efficiently do collisions with huge voxel geometries in Panda by creating an octree of CollisionNode objects, with eight CollisionBox objects at the leaves. By setting it up as an octree, Panda’s collision traverser will be able to test quite efficiently.

Thanks @rbd for your precious info.
Sorry for the errors in the example code, I copy/pasted it from my real application and I didn’t check it carefully.

I do feel adventurous and I’ll sure give it a try. I’m on a tight schedule so I guess I’ll do it after the first deadline has been reached. I know nearly nothing about shader programming. Can you point me to some resources where I can learn the basics I’d need for this application?

I’ll try your advise asap. From my understanding, the main optimization was removing the overhead from primitive->add_vertices(...), is it right? This is gonna reduce the “CPU” time required to describe the scene, but it should not affect the drawing time, is it right? I’m asking to better understand how things work.
If this is the case, then call modify_vertices(num) would be the best option.

Since this is gonna be roughly a 20% of the total workload, my target is from 5ms to 8ms to draw all the cubes. I don’t know if it is reasonable.

I’ve added a link to the sample that contains the info you should need to implement InstancedNode, including a shader. I’m afraid I don’t know of any resources that teach GLSL shader programming, but try Googling around.

The main optimizations that I applied are (1) presizing both the vertex data and primitive arrays, and (2) optimizing the process of writing data to them.

Yes, it’s not going to reduce the overhead of drawing the geometry, but drawing 10 000 cubes should not be an issue for performance. A modern GPU can draw millions of triangles without breaking a sweat.

The target sounds attainable to me.

By using your version of set_vertex_vector the framerate raised to 40, then by pre-computing it and using modify_vertices(num) it went to 60fps.
Total time is 10ms, and for now I’m happy about it.

When I’ll have completed the application, I’ll see where to optimize. In case I do need to optimize again I’ll look at the hardware instancing.

Thank you so much.
I think I’ll post the final code, after a cleanup.