Background upload of textures

vectorcharlie · January 25, 2025, 5:25am

I will check it for sure tomorrow. I been reading the code and your sample script. When you say “setup_async_transfer(3)” that assigns the simultaneous upload to 3 in the system or for that Texture()?

If it’s for the Texture(), I guess it caches it secuentially, and then you can access it secuantially with, let’s say, 3 clear_ram_image()/set_ram_image() combo? Or how you drain the 3 simultaneous uploads on the Texture()?

Or maybe that is for stereo/cube textures and ate only ready to access once uploaded?

All of this I am going to play with and test around. Will report with conclusions on my own questions and all the amazing success this will gove even for a single async upload!

rdb · January 25, 2025, 8:49am

The number of buffers is per-texture (sharing buffers system-wide would be an interesting idea, but they are sized specific to the texture, so it would be difficult to implement).

An upload still gets triggered at the normal time, which is:

The next time an object with that texture is rendered
If prepare() is called, at the beginning of the next call to render_frame()

It can’t start it right away due to the need for an active OpenGL context, though there may still be room for future optimization here.

Normally, when this upload is triggered, the copy will be done synchronously in the middle of the render, causing lag. With async buffers, it will instead grab one of the available buffers, pass it off to another thread, which does the copy to the buffer in the background, and then pass it back to the draw thread which tells the driver to use that buffer as source for the texture data.

While the async transfer is underway, the card will still be rendered with the old version of the texture. So the render will be “behind” a bit compared with the Texture object. There is no way to “queue up” frames on the Texture right now, but that’d be an interesting suggestion if you were to need such a feature. (You could use multiple textures, each with a single async buffer, use prepare() to get the next one ready and swap between them in order to achieve this right now.)

Sometimes, when the draw process wants to upload a new image, the previous image is still being processed by the other thread. This could happen if the copy operation takes longer than your graphics frame takes to render, as was the case for my test case. With a single buffer, this would cause Panda to block, waiting for the last copy to be done so that it can copy the new image into the buffer. To avoid this you can ask for 2 or more buffers, so that Panda can already start uploading the next image even while the previous image is not done yet. Panda will “ping pong” or “cycle” between the buffers.

(You can see this in effect in the screenshot I showed earlier: the three horizontal sections are three different threads, the top one being the main thread that also does rendering. The pink blocks in the second and third thread sometimes take longer than the duration of a render frame, so they get spread across two threads and overlap.)

You can also specify -1 buffers to ask Panda to dynamically create and destroy buffers for every individual upload. That is mainly useful for doing a one-time async upload. However, in my experience, the savings of doing this are not that great since allocating space for a large buffer can be pretty slow.

rdb · January 25, 2025, 10:37am

I made some more changes, including fixes to make prepare() work (and the AsyncFuture object that it returns now allows you to know when a given frame is done uploading, by registering a callback, awaiting it in a coroutine or polling it with .done()). Also, I implemented the copy-on-write protection for modify_ram_image(), though I still recommend using set_ram_image() instead.

New builds will appear shortly here.

rdb · January 28, 2025, 10:26am

Changes are merged to the master branch now, new builds here, which also recognizes the following Config.prc variables (current defaults are shown):

# how many extra threads to create for texture transfers, 2 allows
# simultaneously copying into 2 async buffers. note that too many
# threads won't help due to hitting bandwidth limits
gl-texture-transfer-num-threads 2

# scheduling priority of these threads, low/normal/high
gl-texture-transfer-thread-priority normal

I am thinking about further extending it with a queue of RAM images on the texture so you can have more control over when a particular RAM image is actually shown.

vectorcharlie · April 2, 2025, 4:29am

@rdb Thanks so much again and sorry for the delay in the feedback. Not only I had to restructure my entire software to make proper use of the new texture load process, but also had to design a smart handler of the multithreaded timing from reading the image from disk all the way to create the texture in the Qt widget I am lusing… all in nice perfect sync.

Long story short it DOES WORK WONDERFULLY!

The upload of the texture does not interrupt the task.step() and the application handles smoothly until the async .done() has the texture ready for the geo.

I still can’t fully understand how gl-texture-transfer-num-threads 2 would work (or how I would pick the right texture once ready). So I used gl-texture-transfer-num-threads 1 and did a handler of the upload that can track which texture is ready.

I have to do it carefully so it does not clog the graphics card, or interrupt the timing to download the buffer to draw the Qt widget.

So a bunch of crazy interesting technologies I had to put together. Here a quick breakdown:

-Run python multiprocessing with shared memory to use OIIO image reader. The shared memory contains the bytes that I would use for the texture later. (Also texture file path handler and other bite of info that the application uses)
-The multiprocessing handle the reading of the file, color and resolution transformations, and packages in a format that matches the texture. I learned the hard way how careful you have to be on closing the spawn python interpreters and the cleanup of shared memory once the application is closed.
-A Qt application have a timer that runs at 1/48th of a second, and flags between preparing the textures and preparing the buffer. So textures are read in it’s own “period”, and the buffer memory download for the Qt widget has its other own “period “ I found if I .step() the prepare() of the texture at the same .step() time of the .trigger_copy() of the graphics_output for Qt they clog…, so again, orchestration was the trick.

After all that… I have a texture that syncs beautifully 24fps image file playback.

Now I will move to use .mov files instead of file sequences (will check if there is any benefit of using the OIIO ffmpeg implementation or simply use the panda3d internal version) and then Sound sync for frame sequences.

The other thing I am curious to try is the cull/draw multithread. I don’t think there is any benefit on using it and when I activate there is some texture conflicts that I will have a look later. Is that something that would work with gl-texture-transfer-num-threads?

As I been working for so many years with Panda I am so happy that this kind of implementations are still applicable… so thanks to all for the amazing work!

rdb · April 3, 2025, 7:46am

Glad you got it to work!

These changes should be compatible with cull/draw multithreading in theory, but I haven’t tested it extensively. Let me know when you run into issues with that.

gl-texture-transfer-num-threads is separate from that entirely; it is useful in the situation that the texture upload takes longer than a single rendering frame, so Panda can start a new upload while another one is still taking place (assuming the texture has enough buffers assigned with setup_async_transfer). If everything is silky-smooth as-is, there’s probably no reason to use this, but it doesn’t really hurt to increase it either if you have the cores and video RAM. You can use PStats to verify how close you are to hitting the limit of a single thread.

It can also be a conscious choice to limit the number of buffers/threads, causing the framerate to slow down if the conversion is too slow, so that you don’t end up using too many resources and/or introducing too much latency.

vectorcharlie · April 3, 2025, 5:21pm

Thanks! Will get back when I do testing on cull/draw. As of now I get some flashing textures, but might be because of the way I organize my python multprocess memory reading or the way I swap textures. So more study is needed.

Thanks again!

vectorcharlie · May 20, 2025, 1:17am

Having amazing progress with the async upload of textures.

I am trying to find some information on the thread(s) used for the uploads.

Is there a way I can check how many uploads are happening? Someting like a AsyncTaskManager of the AsyncFutures that are running?

I can build my own tracker in python using the AsyncFuture with the prepare() result and callback, but maybe there is already a way I can get a list of what is being uploaded?

What I am trying to find is a way to avoid clogging the upload with too many concurrent uploads, so knowing how many uploads are happening (and from which textures) before slowing down framerate will let the software calculate dynamicaly the right amount for the resolution/bitdepth combination. At the end, is mainly trying to find the bandwidth limit of the system.

Thanks!

rdb · May 20, 2025, 5:30am

It’s never more than what’s configured in Config.prc variable gl-texture-transfer-num-threads.

I suppose you could query the task manager for the “gl_texture_transfer” task chain, and you can check how many tasks have been scheduled with it. However, these tasks will only start after you’ve called prepare() and the next render_frame() has begun the async upload.

mgr = AsyncTaskManager.get_global_ptr()
chain = mgr.find_task_chain("gl_texture_transfer")
print(chain.get_num_threads())
print(chain.get_num_tasks())
print(chain.get_active_tasks())
print(chain.get_sleeping_tasks())

vectorcharlie · May 20, 2025, 7:34am

Got it! I think what I was missing is the “gl_texture_transfer” tag for the task chain.

Regarding of the Config.prc variable, I am always using 1 directly in the Texture(s) with .setup_async_transfer(1)

Then, I trigger multiple ones as needed.

Lastly, I am noticing that when I delete a texture, either using “del Texture”, “Texture.release_all()” or simply reassigning the variable (or a combination of it), the memory is not cleared.

I even tried “loader.unloadTexture()” with no luck. There is always a considerable footprint in the memory (I am using 4 to 6 4k textures to get some caching). Once a playback of swaping textures is done, I want to clean up all the textures without the need of the user to close the application to clear ram.

render.findAllTextures() confirms the texture is not there… Is there a need to remove the async buffer in a specific way? I also did a test qith .setKeepRamImage(False) to make sure it was not the ram “image” that was surviving the delete.

Thanks!

vectorcharlie · May 20, 2025, 6:50pm

After further analizis, seems that the issue is that I have a model with a reference that was still floating around. Using .clearTexture() on the model released the memory.

Will keep cleaning up the code, but seems that is not re: the buffer for the texture upload

rdb · May 24, 2025, 1:07pm

gl-texture-transfer-num-threads is something else than .setup_async_transfer(1); the number of threads is application-wide (so no more than that number of transfers can be going on at the same instant), whereas the number passed to setup_async_transfer is per-texture.

Use tex.release_all() to release GPU memory associated with the texture (happens at the next render_frame). Normally when all references to a texture object are released this happens automatically (as you have discovered).

vectorcharlie · June 3, 2025, 8:23pm

As I keep doing detective work to find memory leaks, I found a funny issue.

My app uses this amount of memory before loading a big 4k texture:

If I report in the app the amount of texture Memory (and texture names) using .getPreparedTextures() from the GSG (and then .getResident(PGO) in the texture) everything looks clean and nice.

I have to do this because pstats (at least on macOS, I have to yet test in Windows) does not work with .setup_async_transfer() when the value is > 0… will get back to this later

If I load a texture with the following parameters:

textureVariable = Texture(layerName)
textureVariable.setup2dTexture(width, height, textureType, Texture.F_rgb)
textureVariable.setup_async_transfer(0)
textureVariable.setKeepRamImage(False)
textureVariable.set_ram_image(imageMemory.buf[:textureVariable.ram_page_size])
asyncVariable = TextureVariable.prepare(GraphicsStateGuardianBase.getDefaultGsg().prepared_objects)

Everything looks nice and it loads great.

And Memory looks and reacts as expected,

If I remove the texture with:

textureVariable.release_all()
textureVariable.clear()

I get a reasonable cleanup size, but most important, is consistent and it does not increase over time:

And inside the .step() is reported correctly that the texture is not prepared.

(I mean, there is a 3 bytes somehow even if I get the textureVariable = None, but I guess that is fine.

If I repeat the process 10-20 times the report of the memory in the OS is, again, consistent, and at some point even it cleans back

Now… For the interesting part:

If I do the same but with:

textureVariable.setup_async_transfer(1)

There is no cleanup of memory when I erase. Actually the memory increases in similare in size with what the expected_ram_image_size is reported:

First load-erase:

Second load-erase:

Third load-erase

And so on until the app starts swapping like crazy.

And is hard to analize, as when I tag any texture with the .setup_async_transfer(1) the pstats application freezes. I tried to find the code for 1.11.0 (using dev3702) to see if I can find something, but can’t get to the source in github (or at least don’t know where to look for it).

Am I missing something on the cleanup of the asyncs? I can’t find any way to garbage collect async usage.

I have tried gc.garbage and cleaning up the pools etc. Even if I remove the ShowBase instance the memory is still there somehow… Until I close the app.

Thanks as always!