RTMCopyRam overhead

Craig · December 21, 2011, 12:12am

I discovered that copying a texture to ram on my system via RTMCopyRam takes 8.2 ms (Copy Texture section from text-states), even for 1*1 textures. I understand there there is quite a bit of overhead for various reasons, but it seems a little odd that I can render an entire complex scene with animations and complex shaders into around 6 buffers in the time it takes to copy a 1 pixel texture back to the CPU.

I’m trying to do automatic exposure adjustment, so I’m reducing my scene to 1 by 1, then looking at the brightness. I can do all the math GPU side in a shader and put my result in the 1*1 texture instead of a shader input, so I can get around the issue, but I’d still like to know the cause. Perhaps panda could use a bit of optimization in this area, or is it really this slow? Maybe there are some special flags I need to set?

zhao · December 23, 2011, 3:52am

Hi Craig, can you post a very reduced code showing your problem?

I’ve tried this test myself, and found that transfering a 128x128 texture takes about 1/2 ms. This is with a nvidia GTX 260.

My test was to rendered smiley to an 128x128 offscreen buffer, and using a task, everyframe call extractTextureData to move the texture over to Ram. To test that it copied over correctly, I write the texture (RAM) to a jpg.

from pandac.PandaModules import *
import direct.directbase.DirectStart
import math
import time

npModel = loader.loadModel('smiley')
npModel.reparentTo( render )

base.cam.setPos( -20, -20, 30 )
base.cam.lookAt( Point3( 0, 0, 0) )

tex = Texture()
winprops = WindowProperties()
winprops.setSize( 128, 128)
fbprops = FrameBufferProperties()
fbprops.setColorBits(1)
fbprops.setAlphaBits(1)
fbprops.setStencilBits(0)
fbprops.setMultisamples(0)
fbprops.setDepthBits( 0 )
objBuffer = base.graphicsEngine.makeOutput(base.pipe, 'hello', -10, fbprops, winprops, GraphicsPipe.BFRefuseWindow, base.win.getGsg(), base.win)
objBuffer.addRenderTexture(tex, GraphicsOutput.RTMBindOrCopy, GraphicsOutput.RTPColor)

cam = base.makeCamera( objBuffer, sort = 0, lens = base.cam.node().getLens(), scene = render )
cam.setPos( base.cam.getPos() )
cam.lookAt( Point3(0,0,0) )

base.graphicsEngine.renderFrame()
global booPic
booPic = False

def foo():
	global booPic
	booPic = True

def test(task):
	global booPic
	npModel.setPos( 5.0*math.sin( time.time()/3), 0, 0 )
	base.graphicsEngine.extractTextureData( tex, base.win.getGsg()) 
	if booPic:        
		tex.write('/c/test/testram2.jpg')
		booPic = False
		print 'pic taken'
	return task.cont

taskMgr.add( test, 'test' )

base.accept( 'a', foo )


run()

Craig · December 23, 2011, 4:59am

GraphicsOutput.RTMBindOrCopy+base.graphicsEngine.extractTextureData( tex, base.win.getGsg()) is not an approach I was aware of, I just used RTMCopyRam. Looks like performance (FPS) is the same, though the reported “Copy texture” values are different.

With your code: “Copy texture = 2.97976 ms”

I change it to use RTMCopyRam to do the same thing and “Copy texture = 16.3841 ms”

Turn off v-sync and copy texture drops to about 0.2ms

both run at about 450 fps with v-sync off.

Conclusion: “Copy texture” value reported by pstates is flakey, and usually wrong, and does not indicated time spent copying textures. The overhead is mostly an illusion / pstat bug.

I even saw a case of adjusting my shaders effecting the time spent in “Copy texture”. The time reported is wrong, and thats that. It was including large chunks of time from other parts of rendering.

Thanks for your simple example and testing.

drwr · December 23, 2011, 6:10pm

PStats accurately reports the time that Panda waits for the glReadPixels() call to complete. But because of the nature of the OpenGL pipeline, some of that time may also include the time required to wait for the graphics card to finish drawing the scene.

This, of course, depends heavily on how your graphics driver chooses to implement glReadPixels().

David