Allowing Torch/Tensorflow to directly access rendered image in GPU

Hi,

We made a simulator MetaDrive with Panda3D: GitHub - metadriverse/metadrive: MetaDrive: Composing Diverse Scenarios for Generalizable Reinforcement Learning

Now we have some projects where retrieving rendered images is required to do path planning for a target object. We find moving rendered images from the graphics buffer in GPU to RAM costs a lot of latency. After that, we still have to move these images from RAM to CUDA for training our decision-making models. We would like to know if it is possible to modify the source code so that libraries compatible with CUDA like torch/tensorflow can directly access the data in Panda3D’s GraphicsBuffer. If so, we can copy the data and store it in GPU for training our model without causing any latency.

1 Like

Yes, you need to render to a texture (add_render_texture with argument RTM_bind_or_copy or RTM_copy_texture), obtain the OpenGL identifier for the texture, and pass that to cudaGraphicsGLRegisterImage.

You can get the OpenGL identifier for a texture by getting the TextureContext (which represents the Texture on the actual graphics back-end) and calling get_native_id() on it.

From Python code, the normal way to get access to the TextureContext is by calling tex.prepare(0), which will return a future that you can await, or register a callback with that will return a TextureContext as soon as it is created on the back-end.

But you probably need to make sure that the OpenGL context is bound to be able to use cudaGraphicsGLRegisterImage anyway, so you can also use prepare_now to get a TextureContext object right away, but you need to do this from a draw callback or some other moment when you can guarantee that the OpenGL context is bound.

If you don’t mind using C++ you can also manually bind the OpenGL context for a GraphicsOutput with code like this:

Thread *current_thread = Thread::get_current_thread();
if (window->begin_frame(GraphicsOutput::FM_refresh, current_thread)) {
  GraphicsStateGuardian *gsg = window->get_gsg();
  TextureContext *context = tex->prepare_now(0, gsg->get_prepared_objects(), gsg);

  cudaGraphicsGLRegisterImage(resource, context->get_native_id(), GL_TEXTURE_2D, cudaGraphicsRegisterFlagsReadOnly);

  window->end_frame(GraphicsOutput::FM_refresh, current_thread);
}
2 Likes

Hi, rdb

Many thanks for your reply! I also checked this post: How to get GPU memory pointer from `Texture` Object - #3 by Hao-Yan. I think I can roughly understand the workflow now. As I don’t want to touch any C++ stuff at this stage, I would like to use Nvidia/CUDA-Python to finish the OpenGL-CUDA interoperation: cudart - CUDA Python 12.0.0 documentation.

Do you think is it possible to finish all cuda-opengl actions with this? Actually, I tried this through the following code:

from cuda.cudart import cudaGraphicsGLRegisterImage, cudaGraphicsRegisterFlags, GLuint, GLenum
  # set texture
    my_texture = Texture()
    my_texture.setMinfilter(Texture.FTLinear)
    my_texture.setFormat(Texture.FRgba32)

    engine.win.add_render_texture(my_texture, GraphicsOutput.RTMCopyTexture)
    gsg = GraphicsStateGuardianBase.getDefaultGsg()
    texture_context = my_texture.prepareNow(0, gsg.prepared_objects, gsg)
    # texture_context = my_texture.prepare(gsg.prepared_objects)
    identifier = texture_context.getNativeId()
    flag, resource = cudaGraphicsGLRegisterImage(identifier, 1,
                                                 cudaGraphicsRegisterFlags.cudaGraphicsRegisterFlagsNone)

But the returned flag is CudaError: InvalidValue, which indicates that the arguments are not in an acceptable range, and thus the resource handle is an invalid one as well. I am inexperienced at these CG stuff, actually, so could you share with me some intuition, like if this Cuda-Python lib works or not? If not, should I write a CUDA backend and provide python bind to my Panda3D program?

The second argument to cudaGraphicsGLRegisterImage should be GL_TEXTURE_2D, not 1.

As I mentioned you need to make sure the OpenGL context is bound at this time. In the single-threaded single-GSG rendering pipeline, after Panda has already rendered a frame, you may be lucky that the OpenGL context happens to still be bound. A better way to do this in Python is to create a draw callback and associate it with the display region.

Also, try calling base.graphicsEngine.renderFrame() twice before obtaining the native ID. Unlike prepare(), prepareNow() doesn’t fully construct the texture object, so it will be constructed upon first being rendered to.

1 Like

Thank you very much for your reply. As a noob, I still have some questions.

How can I get GL_TEXTURE_2D? I cannot find the definition of it anywhere. I checked the Panda3D source code and thought gsg->get_texture_target(Texture.TT_2d_texture) should return GL_TEXTURE_2D, while no such an API, get_texture_target, on Python side. Could you tell me how to get the texture_target, GL_TEXTURE_2D?

I modified my code as follows:


    # get gsg
    gsg = GraphicsStateGuardianBase.getDefaultGsg()

    # draw callback
    def _callback_func(cbdata):
        texture_context = my_texture.prepareNow(0, gsg.prepared_objects, gsg)
        identifier = texture_context.getNativeId()
        flag, resource = cudaGraphicsGLRegisterImage(identifier, GL_TEXTURE_2D,
                                                     cudaGraphicsRegisterFlags.cudaGraphicsRegisterFlagsReadOnly)
        cbdata.upcall()

    engine.win.getDisplayRegion(1).setDrawCallback(_callback_func)

    for i in range(10000):
        engine.taskMgr.step()
        # do something with resource handle

Is this what you mean by adding a callback for making sure the OpenGL context is still bound? I also have another version as follows, I can not tell which one is better.

    # get context future
    gsg = GraphicsStateGuardianBase.getDefaultGsg()
    texture_context_future = my_texture.prepare(gsg.prepared_objects)

    for i in range(10000):
        engine.taskMgr.step()
        if texture_context_future.done():
            engine.graphicsEngine.renderFrame()
            engine.graphicsEngine.renderFrame()
            identifier = texture_context_future.result().getNativeId()
            flag, resource = cudaGraphicsGLRegisterImage(identifier, GL_TEXTURE_2D,
                                                         cudaGraphicsRegisterFlags.cudaGraphicsRegisterFlagsReadOnly)
            # do something with resource

As I didn’t solve the GL_TEXTURE_2D problem, I can’t know which one would work from the results by running them respectively. Could you give some suggestions on these code snippets?

You need to get the definition of GL_TEXTURE_2D from some library that exposes these constants, such as PyOpenGL or some other binding. Or, you can just define it yourself using the known value from the Khronos Registry or the OpenGL headers. The value is 0x0DE1.

Your second snippet relies on Panda still leaving the OpenGL context bound after rendering, since (presumably) cudaGraphicsGLRegisterImage expects an OpenGL context to be bound.

Of course, you probably want to do this operation only once, rather than every frame.

1 Like

Perfect! The second way works by setting the correct GL_TEXTURE_2D from PyOpenGL. Now I manage to build a pipeline as follows:

  1. Panda3D render to texture
  2. Map resource with cudaGraphicsRegisterImage
  3. get the resource pointer and make a copy via CUDA-Python
  4. convert CUDA-Python to CuPy array and then to torch

Now everything happens only on GPU! But I am still debugging, as nothing is contained in the final torch tensor which is full with 0. So far, I am not sure what causes this problem, but I am really happy. At least the whole pipeline works. Thank you!

By the way, I find a pretty interesting example: CuPy/OpenGL interop example · GitHub it shows how to register VBO and EBO buffer created by glGenBuffers. And then modify the buffer content with CuPy to do the rendering. Do you think is it possible to do similar things in Panda3D with pure Python code? I think that register from buffer would be more efficient, right?

Not with pure Python code (or at least not with fragile ctypes trickery) at the moment because we don’t expose the native ID for buffers. However, this sounds like a simple feature request, please do file a feature request on the issue tracker on GitHub!

2 Likes

Good News! @rdb
It works! the image read from CUDA memory and the one rendered by Panda are as follows. Ignore th e wrong channel sequence, lol

Here, I would like to share my code. Hope it can help others.

import time

import torch
from torch.utils.dlpack import to_dlpack
from torch.utils.dlpack import from_dlpack

from cuda import cudart
import cv2

import cupy as cp
from panda3d.core import NodePath, GraphicsOutput, Texture, GraphicsStateGuardianBase
import cupy as cp
import numpy as np
from OpenGL.GL import *  # noqa F403
from cuda import cudart
from cuda.cudart import cudaGraphicsRegisterFlags
from panda3d.core import loadPrcFileData
from metadrive.component.pgblock.curve import Curve
from metadrive.component.pgblock.first_block import FirstPGBlock
from metadrive.component.pgblock.intersection import InterSection
from metadrive.component.road_network.node_road_network import NodeRoadNetwork
from metadrive.tests.vis_block.vis_block_base import TestBlock, BKG_COLOR
from OpenGL.GL import glGenBuffers


# require:
# 1. pip install cupy-cuda12x
# 2. CUDA-Python
# 3. PyOpenGL
# 4. pyrr
# 5. glfw
#


def format_cudart_err(err):
    return (
        f"{cudart.cudaGetErrorName(err)[1].decode('utf-8')}({int(err)}): "
        f"{cudart.cudaGetErrorString(err)[1].decode('utf-8')}"
    )


def check_cudart_err(args):
    if isinstance(args, tuple):
        assert len(args) >= 1
        err = args[0]
        if len(args) == 1:
            ret = None
        elif len(args) == 2:
            ret = args[1]
        else:
            ret = args[1:]
    else:
        err = args
        ret = None

    assert isinstance(err, cudart.cudaError_t), type(err)
    if err != cudart.cudaError_t.cudaSuccess:
        raise RuntimeError(format_cudart_err(err))

    return ret


class CUDATest:
    def __init__(self, window_type="onscreen", shape=None, test_ram_image=False):
        assert shape is not None
        self.engine=ShowBase(window_type)

        # create your scene

        # buffer property
        self._dtype = np.uint8
        self._shape = shape
        self._strides = None
        self._order = "C"

        self._gl_buffer = None
        self._flags = cudaGraphicsRegisterFlags.cudaGraphicsRegisterFlagsNone

        self._graphics_resource = None
        self._cuda_buffer = None

        # # make buffer
        # self.texture = self.engine.loader.loadTexture("/home/shady/Desktop/test.jpg")
        self.texture = Texture()
        # self.texture.setMinfilter(Texture.FTLinear)
        # self.texture.setFormat(Texture.FRgba32)
        mode = GraphicsOutput.RTMCopyRam if test_ram_image else GraphicsOutput.RTMBindOrCopy
        self.engine.win.addRenderTexture(self.texture, mode)

        self.texture_identifier = None
        self.gsg = GraphicsStateGuardianBase.getDefaultGsg()
        self.texture_context_future = self.texture.prepare(self.gsg.prepared_objects)
        self.new_cuda_mem_ptr = None

    @property
    def cuda_array(self):
        assert self.mapped
        return cp.ndarray(
            shape=self._shape,
            dtype=self._dtype,
            strides=self._strides,
            order=self._order,
            memptr=self._cuda_buffer,
        )

    @property
    def gl_buffer(self):
        return self._gl_buffer

    @property
    def cuda_buffer(self):
        assert self.mapped
        return self._cuda_buffer

    @property
    def graphics_resource(self):
        assert self.registered
        return self._graphics_resource

    @property
    def registered(self):
        return self._graphics_resource is not None

    @property
    def mapped(self):
        return self._cuda_buffer is not None

    def __enter__(self):
        return self.map()

    def __exit__(self, exc_type, exc_value, trace):
        self.unmap()
        return False

    def __del__(self):
        self.unregister()

    def register(self):
        assert self.texture_identifier is not None
        if self.registered:
            return self._graphics_resource
        self._graphics_resource = check_cudart_err(cudart.cudaGraphicsGLRegisterImage(self.texture_identifier,
                                                                                      GL_TEXTURE_2D,
                                                                                      cudaGraphicsRegisterFlags.cudaGraphicsRegisterFlagsReadOnly))
        return self._graphics_resource

    def unregister(self):
        if not self.registered:
            return self
        self.unmap()
        self._graphics_resource = check_cudart_err(cudart.cudaGraphicsUnregisterResource(self._graphics_resource))
        return self

    def map(self, stream=0):
        if not self.registered:
            raise RuntimeError("Cannot map an unregistered buffer.")
        if self.mapped:
            return self._cuda_buffer

        check_cudart_err(cudart.cudaGraphicsMapResources(1, self._graphics_resource, stream))
        array = check_cudart_err(cudart.cudaGraphicsSubResourceGetMappedArray(self.graphics_resource, 0, 0))
        channelformat, cudaextent, flag = check_cudart_err(cudart.cudaArrayGetInfo(array))

        depth = 1
        byte = 4  # four channel
        if self.new_cuda_mem_ptr is None:
            success, self.new_cuda_mem_ptr = cudart.cudaMalloc(
                cudaextent.height * cudaextent.width * byte * depth)
        check_cudart_err(
            cudart.cudaMemcpy2DFromArray(self.new_cuda_mem_ptr, cudaextent.width * byte * depth, array, 0,
                                         0,
                                         cudaextent.width * byte * depth, cudaextent.height,
                                         cudart.cudaMemcpyKind.cudaMemcpyDeviceToDevice))
        if self._cuda_buffer is None:
            self._cuda_buffer = cp.cuda.MemoryPointer(
                cp.cuda.UnownedMemory(self.new_cuda_mem_ptr,
                                      cudaextent.width * depth * byte * cudaextent.height,
                                      self), 0)
        return self.cuda_array

    def unmap(self, stream=None):
        if not self.registered:
            raise RuntimeError("Cannot unmap an unregistered buffer.")
        if not self.mapped:
            return self

        self._cuda_buffer = check_cudart_err(cudart.cudaGraphicsUnmapResources(1, self._graphics_resource, stream))

        return self

    def step(self):
        self.engine.taskMgr.step()
        if not self.registered and self.texture_context_future.done():
            self.texture_identifier = self.texture_context_future.result().getNativeId()
            self.register()


if __name__ == "__main__":
    win_size = (512, 512)
    loadPrcFileData("", "textures-power-2 none")
    loadPrcFileData("", "win-size {} {}".format(*win_size))
    test_ram_image = False
    render = True
    env = CUDATest(window_type="offscreen", shape=(*win_size, 4), test_ram_image=test_ram_image)
    env.step()
    env.step()
    start = time.time()
    for s in range(10000000):
        env.step()
        if test_ram_image:
            origin_img = env.texture
            img = np.frombuffer(origin_img.getRamImage().getData(), dtype=np.uint8)
            img = img.reshape((origin_img.getYSize(), origin_img.getXSize(), 4))
            img = img
            torch_img = torch.from_numpy(img)
            if render:
                cv2.imshow("win", img)
                cv2.waitKey(1)
        else:
            with env as array:
                ret = from_dlpack(array.toDlpack())
            if render:
                np_array = cp.asnumpy(ret)[::-1]
                cv2.imshow("win", np_array)
                cv2.waitKey(1)
        if s % 2000 == 0 and s != 0:
            print("FPS: {}".format(s / (time.time() - start)))


But the texture that the buffer renders into always has a shape (1024, 1024), even if I set the buffer size to (800, 600). I verified this by setting the window size to (800, 600) through loadPrcFileData("", "win-size {} {}".format(800, 600)), and create a texture associated with showbase.win. In RTMCopyRam mode, the texture.getRamImage() is in the same shape as the window size. However, when I change the mode to RTMBindorCopy and use CUDA api to read the texture, the shape is fixed to (1024, 1024), regardless the buffer size is (800, 600). Could you help me figure it out?

2 Likes

Set textures-power-2 none in the PRC data.

1 Like

Yes, you are right. Textures with shapes in pow-2 like (512,512), (256, 256) can work perfectly. But textures-power-2 none seems can not disable this feature.

The rendered content is messy. With the code provided above, I set:

    win_size = (256, 600)
    loadPrcFileData("", "textures-power-2 none")
    loadPrcFileData("", "win-size {} {}".format(*win_size))

and get the following result:

PS: By setting the Prc, the texture shape is indeed (256, 600) now in CUDA memory. The point is that the rendered content is messy.

My bad. I solved it. It is because the (width,height) is (height, width) in CUDA memory. Swap the axis and solve it finally. Thank you!