hitting a speed limit

castironpi · March 23, 2010, 2:20am

Yeah, actually, I’m toggling the color on two tiles by keyboard control, which should get me going. I need some lines to connect certain tiles, and different textures on certain tiles, possibly different angles on certain tiles (oof.), a drag box for selecting an area to zoom in on, and some mouse-hit testing on tiles to identify and possibly manipulate (scale, position, rotate) them, but the graphics are very flat, so this isn’t a front burner issue anymore.

The thing I actually need help with is some modelling ideas. I’ve tried two different strategies which have some pitfalls. It’s just this isn’t the group for it! I want some things to be specified in a text file, and store some persistent binary information between runs about some other things, as in a pickle or Python shelf. Also, there’s some issues I’m having a hard time seeing my way through with redundancy and nested layers of the data objects I’m describing in text.

castironpi · March 23, 2010, 7:32am

I’ve sort of stagnated on pushing this limit. I got up to 160,000 pairs of triangles at 10 fps.

This is a helper class for storing the GeomVertexData and NodePath objects.

class VerticesWrap:
	def __init__( self ):
		geomformat = GeomVertexFormat.getV3c4t2( )
		self.vdata = GeomVertexData( 'blank', geomformat, Geom.UHStatic )
		self.vertex = GeomVertexWriter( self.vdata, 'vertex')
		self.color = GeomVertexWriter( self.vdata, 'color')
		self.texcoord = GeomVertexWriter( self.vdata, 'texcoord')
		self.prim = GeomTriangles( Geom.UHStatic )
		self.count = 0
		geom = Geom(self.vdata)
		geom.addPrimitive(self.prim)

		node = GeomNode('gnode')
		node.addGeom(geom)
		self.gpath = NodePath(node)
		self.gpath.reparentTo( render )
	def add_square( self, tl, br = None ):
		if br is None:
			br = tl[ 0 ] + 1, tl[ 1 ] + 1 # default size: 1, 1
		vertex, color, texcoord= self.vertex, self.color, self.texcoord
		prim= self.prim
		vertex.addData3f( tl[ 0 ], 0, tl[ 1 ] )
		color.addData4f(1, 1, 1, 1)
		texcoord.addData2f(0, 0)
		 
		vertex.addData3f( br[ 0 ], 0, tl[ 1 ] )
		color.addData4f(1, 1, 1, 1)
		texcoord.addData2f(1, 0)
		 
		vertex.addData3f( br[ 0 ], 0, br[ 1 ] )
		color.addData4f(1, 1, 1, 1)
		texcoord.addData2f(1, 1)
		 
		vertex.addData3f( tl[ 0 ], 0, br[ 1 ] )
		color.addData4f(1, 1, 1, 1)
		texcoord.addData2f(0, 1)

		index = self.count * 4
		prim.addVertices( index+ 0, index+ 1, index+ 3 )
		prim.closePrimitive()
		 
		prim.addVertices( index+ 1, index+ 2, index+ 3 )
		prim.closePrimitive()
		self.count+= 1

I create arrays of 40 of each, then fill them. NotG is ‘not’ in disguise, so as to not conflict with the language’s keywords.

xorwraps= [ VerticesWrap( ) for _ in range( 40 ) ]
for xorwrap in xorwraps:
	xorwrap.gpath.setTexture( xortext )
xorwrap= xorwraps[ 0 ]
nandwraps= [ VerticesWrap( ) for _ in range( 40 ) ]
for nandwrap in nandwraps:
	nandwrap.gpath.setTexture( nandtext )
nandwrap= nandwraps[ 0 ]
notgwraps= [ VerticesWrap( ) for _ in range( 40 ) ]
for notgwrap in notgwraps:
	notgwrap.gpath.setTexture( notgtext )
notgwrap= notgwraps[ 0 ]

wraps= [ xorwraps, nandwraps, notgwraps ]
for i in range( 160000 ):
	x, y= i % 400, i / 400
	wrap= wraps[ i% len( wraps ) ]
	j= 0
	while wrap[ j ].count>= 1500:
		j+= 1
	wrap[ j ].add_square( ( x, y ), ( x+ .8, y+ .8 ) )

I’d like to know where these cutoffs occur. I understand that each VerticesWrap instance is a separate batch. I would really like to max out the GPU, so how big is a batch, and how many batches can I have? Or where do the CPU or bus start to be the bottleneck agent? When should I use UHDynamic instead of UHStatic for my flags?

The other attributes of the instances become invalid after the first frame; they should be deleted.

Here’s the toggleFlood function, which changes colors of the index in each of the 3 wraps at index N in their respective arrays.

	def toggleFlood(self,indx):
		self.col= 1- self.col
		for vdata in xorwrap.vdata, nandwrap.vdata, notgwrap.vdata:
			color = GeomVertexWriter( vdata, 'color')
			color.setRow( 4*indx )
			for _ in range( 4 ):
				if self.col== 0:
					color.setData4f( 1, .776, 0, 1 )
				else:
					color.setData4f( 0, .620, 1, 1 )

Each square begins at 4 x its index, since it takes up 4 vertices and thus 4 entries. The ‘setData’ function advances the row pointer, so just call it 4 times after finding the index. Then do that for each of the lists of VertexWraps, and each call will toggle the first ‘indx’ instances in the scene.

The application is kind of specialized: many instances of a small number of rasters with live color changes. It would be nice to examine what existing encapsulations could accomplish this.

drwr · March 23, 2010, 5:42pm

I’m surprised that you’re only getting 10fps on 320,000 triangles, so there is certainly some more investigation to bear here. What kind of graphics card do you have, and what is your CPU?

As a general rule of thumb, you should try to target no more than 300 batches in a frame in order to achieve 60fps. If you’re satisfied with 30fps, you can target 600 batches. There is considerable latitude in this rule of thumb.

There are other possible bottlenecks, too. When you put more than 65,536 vertices into a single GeomVertexData, it has to expand the 16-bit index into a 32-bit index. Some graphics drivers don’t deal with this well, and slow down considerably. In fact, you can ask your driver how many vertices it prefers to have in a batch:

gsg = base.win.getGsg()
gsg.getMaxVerticesPerArray() - returns the max number of vertices in any one GeomVertexData
gsg.getMaxVerticesPerPrimitive() - returns the max number of vertex indices in any one GeomPrimitive

The above limits are as reported by your graphics hardware. If you exceed them, the frame will still render, but it may be at a performance penalty.

I haven’t found a lot of practical difference between UHStatic and UHDynamic, but it’s worth experimenting to see if you find a difference on your hardware.

David

castironpi · March 23, 2010, 7:04pm

My 10fps are a bit mysterious. I wasn’t exceeding the batch count or batch size in the code-- I only allocated 1500x120. And the color toggle was only occurring once per frame max.

Now you have to promise not to laugh.

Renderer: Intel GMA 900
Renderer model: Intel 915GM
GPU: Alviso (915Gx)
(Graphics) Memory: 96MB
System CPU: Intel® Pentium® M processor 1.60GHz

etc. other embarrassing parts.

(Pdb) p base.win.getGsg().getMaxVerticesPerArray()
1024
(Pdb) p base.win.getGsg().getMaxVerticesPerPrimitive()
1024

I should be able to fit 256 triangle pairs into a batch (4 vertices per pair), for a maximum of 256*300 = 76800 triangle pairs. However, running:

for i in range( 76800 ):
	x, y= i % 400, i / 400
	wrap= wraps[ i% len( wraps ) ]
	j= 0
	while wrap[ j ].count>= 1024/4:

I still only get 10-15 fps.

drwr · March 23, 2010, 7:21pm

The Intel 915GM is a fine workhorse; I have one myself in an old Dell laptop that’s served me well for many years. But it’s not a very powerful card at all, and in particular it doesn’t handle a lot of vertices very well. (Actually, the card itself doesn’t do vertex transforms at all–it makes the CPU do all of the vertex processing, which explains why it is so slow.)

So it is possible you are actually reaching the hardware limits of your graphics card. You can test this by examining your performance in PStats–if it shows most of the time spent in “draw”, that’s a good indication that your graphics hardware is indeed the bottleneck. You can also try gradually reducing the number of vertices per batch, and measure the effect on performance. If your frame time (that is, 1.0/fps) follows the number of vertices linearly, that’s also a good indication that you’re vertex bound.

You should make sure you set “sync-video 0” in your Config.prc before making measurements of your fps, to ensure that video sync isn’t playing a part.

You could also try using the tinydisplay software renderer, to see how it compares. Since your graphics card does all the vertex transforms on the CPU, and since you’re dominating the frame time with a large number of vertices, you should get comparable performance with tinydisplay. But you might get even better performance, since tinydisplay makes some compromises to achieve a better frame rate in certain circumstances. To use tinydisplay, replace “load-display pandagl” with “load-display tinydisplay” in your Config.prc file.

For that matter, you should also try “load-display pandadx8” to try DirectX8, to see if that has an appreciable effect on performance. Sometimes switching to DX makes a big difference, depending on your graphics driver.

David

castironpi · March 24, 2010, 5:36am

Wait, wait, wait. This doesn’t make any sense.

If I want any of the transformation functions that every triangle is supposed to have, I have to reimplement them all myself?

drwr · March 24, 2010, 6:19am

Huh? No, I only said that the card has no hardware transform features. It is a pixel-fill only card. The driver handles the vertex transform on the CPU. But that’s just a detail–you don’t have to write that code; it’s all handled by the driver, and you don’t even know that it’s happening on the CPU, not on the GPU.

David

castironpi · March 24, 2010, 6:28am

Oh, sorry for the confusion. No, I meant in the actual program, such as if I wanted to make a square that was rotated in place, or scaled, etc. Not that it’s impossible, but. That, and let alone wanting to rotate one in place IRTUA later for my wish list.

Actually, I was looking at some nested structures before, so I’m almost repining for glPopMatrix.

drwr · March 24, 2010, 4:51pm

Sure, that has nothing to do with your graphics card. Panda handles all that stuff for you.

The Panda equivalent of glPopMatrix is the scene graph. You attach nested nodes to each other, and the transforms propagate down the graph. Think of a parent-child relationship the same way you think of a push-pop relationship.

David

castironpi · March 24, 2010, 5:10pm

That’s consistent with my first impression of Panda. But I was only creating a handful of VerticesWrap’s, thus a handful of GeomNodes, which as I understand is the leaf node in the tree where I would set these rotations. But I only want to rotate one, not the entire batch of vertices / triangles. How do I pick a few out and give them a node?

drwr · March 24, 2010, 5:33pm

In order to rotate some geometry independent of other geometry, it has to be part of its own GeomNode. So if there’s a group that you want to control independently, you have to construct it separately.

You can stack up GeomNodes however you like in the scene graph, but you can’t subdivide a GeomPrimitive into two different transforms. The whole point of a GeomPrimitive is that it is sent to the graphics card all at once, which means all the triangles stored within it have to have the same state.

David

castironpi · March 25, 2010, 12:35am

Well, now I have to do some contortions if I want to get what I want.

One possibility is to remove the primitive from the node, put it in its own node, transform that, then restore the original with the new coordinates. But I couldn’t treat the nested structures as a unit.

I can’t help feeling that the rigid body combiner “should” still be right for the job. Here’s what the docs say:

“after you called collect() you may freely transform all nodes below without having to call this again”

This is still mysterious and why wasn’t the RBC working for me?

drwr · March 25, 2010, 2:46am

The RBC only handles transform changes, it doesn’t handle other kinds of state changes like color. So if you want to be able to animate the colors of your primitives individually, you have to do it the hard way.

As we’ve demonstrated, you can animate color by fiddling directly with the GeomVertexData. You can also animate transform the same way, but now you do have to compute the new vertex position yourself, which is clumsy.

Yet another answer is to construct your GeomVertexData with special animation data built into it. This is actually what the RBC does internally. Then you can associate a different matrix with each tile.

This requires defining a special GeomVertexFormat with a new column, called “transform_blend”, 1 component, NTUint16, CIndex. You will store an index number into this column for each vertex; that number will index into the TransformBlendTable that you create and store on the GeomVertexData with setTransformBlendTable().

To populate the TransformBlendTable, do something like this:

nodePaths = []
transforms = []
tbTable = TransformBlendTable()
for i in range(numPieces):
  uvt = UserVertexTransform()
  transforms.append(uvt)
  nodePaths.append(NodePath('dummy'))
  tb = TransformBlend(uvt, 1.0)
  tbTable.addBlend(tb)

And populate your GeomVertexData so that each vertex has the appropriate value for i in the transform_blend column, corresponding to the tile that vertex belongs to.

Now you can do this:

nodePaths[i].setPos(x, y, z)
# other operations on nodePaths[i]'s transform
transforms[i].setMatrix(nodePaths[i].getMat())

and the i’th tile will move to position (x, y, z).

What you’ve done is basically reproduced the work that the RBC is doing. You still have to animate the color changes by directly manipulating the vertex data, but you can do this now, because you have fully constructed the GeomVertexData yourself.

David

castironpi · March 25, 2010, 2:57am

Boy that sounds hard.

This would (“would”, I say) be easy enough to find out for myself, but what about representing each tile with two tiles, one in each of the (only) two colors, and then hiding one?

Hiding could include either putting one behind the other, shrinking it to size 0, calling ‘hide’, or disabling that triangle pair on the card by index. If nodes or node paths are secretly storing GPU-side indexes, I think it would be easier to dupe the structure in ‘ctypes’ and read the index, than go through all this work.

…Especially since UserVertexTransform still costs me my nested structures.

drwr · March 25, 2010, 3:08am

That’s an excellent idea. It would be pretty easy to “hide” the tiles by shrinking them to size 0.

Of course, this will double your vertex count; and you’re already limited by vertex count on your particular hardware.

I do apologize for the complexity of all this. I’ll have to say, in Panda’s defense, that the problem you’re proposing to render is pretty unusual in the space of 3-D graphics problems, which is why it is not conveniently handled by Panda (nor is it conveniently handled by your graphics hardware).

David

castironpi · March 25, 2010, 5:52am

By the way, what is the hard way?

castironpi · March 25, 2010, 5:58am

I’m getting some performance issues with this route too, though the scale 0|1 got the correct behavior. At 10,000 pairs of cards, I only get ~15 fps. I was getting 60-80 with hand-made primitives. Do I need to divide up the batch size by hand? Are there two triangles to a card? Then with a batch size of 1024, I get 256 cards per batch.

castironpi · March 25, 2010, 7:55am

Well, I really start to notice at 10 fps anyway. With 1x3 batches x10000 cards, I get 10-15 fps, which is tolerable. With 10x3x1000, I get ~20, which isn’t bad. Maximizing the window seems to cost a few fps, but going fullscreen gains a few fps. Then at 100x3x256, I still get ~20. By then my system memory is gone, and start-up time is horrendous. I’m interested in fixing both of those.

I’m sure I’m missing something neat, but here’s the code currently.

cm = CardMaker('card')
def pathgen( tex ):
	card1 = cm.generate()
	card2 = cm.generate()
	card1.setName( 'high' )
	card2.setName( 'low' )
	path0= NodePath( 'card' )
	path0.setScale( .8 )
	path1= NodePath( card1 )
	path2= NodePath( card2 )
	path1.reparentTo( path0 )
	path2.reparentTo( path0 )
	path0.setTexture( tex )
	path0.setPos( 0, 0, 0 )
	path1.setPos( 0, 0, 0 )
	path2.setPos( 0, 0, 0 )
	path1.setScale( 0 )
	path1.setColor( 1, .776, 0 )
	path2.setColor( 0, .620, 1 )
	return path0
xorpath= pathgen( xortext )
nandpath= pathgen( nandtext )
notgpath= pathgen( notgtext )

paths= [ xorpath, nandpath, notgpath ]
count= 0
for j in range( 100 ):
	rbcs= [ RigidBodyCombiner( 'rbc %i-%i'% ( j, k ) ) for k in range( 3 ) ]
	rbcnps = [ NodePath(rbc) for rbc in rbcs ]
	for rbcnp in rbcnps:
		rbcnp.reparentTo(render)
	for k in range( 100 ):
		for rbcnp in rbcnps:
			ph= rbcnp.attachNewNode( "Tex-Placeholder %i"% k )
			ph.setPos( count% 100, 0, count/ 100 )
			paths[ count% 3 ].copyTo( ph )
			count= count+ 1
	for rbc in rbcs:
		rbc.collect( )

Then to toggle the flood:

		going= render.find( "rbc 0-0/Tex-Placeholder %i/card/%s"% ( indx, [ 'high', 'low' ][ self.col ] ) )
		self.col= 1- self.col
		coming= render.find( "rbc 0-0/Tex-Placeholder %i/card/%s"% ( indx, [ 'high', 'low' ][ self.col ] ) )
		coming.setScale( 0 )
		going.setScale( 1 )

Now the funny thing is, ‘pathgen’ needs to have:

	path1.setScale( 0 )

If it doesn’t, every card gets the orange color and toggling doesn’t work. That’s very mysterious.

The line strips I’ve mentioned will need rotating and scaling, ideally incl. at run-time, as well. If Geoms can participate in the RBC, I can just give each strip its own node. The same thing could work for the tiles, if the Geoms are smaller or faster than the CardMaker. Or if they’re not, then where’s the LinestripMaker?

drwr · March 25, 2010, 5:23pm

I guess you could iterate through the GeomVertexData produced by the RBC and look for the particular vertex value you expect to find. Sounds dicey.

For the record, it sounds like you’re still running with video sync enabled, which is Panda’s default. For performance measurements, it really does help to turn this off, as I suggested above.

“Both of those” meaning the memory limit and the runtime performance?

Note that using the RBC does cost more than directly manipulating vertex values yourself. It’s convenient, but you do pay for the convenience in additional CPU cost. It’s not a big cost, relatively; but you’re fighting for every microsecond on your little CPU.

I wonder if this isn’t the wrong approach for solving your actual goal. For instance, maybe we could do the color changes by fiddling with texture pixels instead of by changing vertex color. And then we could save some vertices by sharing them at the corners, which would save you both memory and performance. This would require that your mesh always remains connected. Is that the case, or do the tiles move completely discontinuously from each other?

Also, since your hardware is so limited, and since you want to do so much on it, you might be a good candidate for writing your application in C++, instead of Python. You’ll still have to solve the same problems in C++, but the resulting program will consume less memory and run slightly faster (especially if the ultimate solution involves computing the needed vertex positions yourself every frame).

This is a detail of RBC: it assumes that any node with an identity transform at the time you call collect() will always have an identity transform, as an optimization. So if you intend to animate a node later, you have to be sure to pre-load a non-identity transform on that node.

David

castironpi · March 26, 2010, 12:39am

No, actually. I was complaining about the initialization time. 100,000 cards takes a half-minute or minute to create.

Well I’m at a loss to explain my stubbornness. I’ll spend X hours fiddling with flags and one class or another, but won’t spend X hours on a rendering routine in C. I’ll tweak every microsecond one place, but drop many milliseconds somewhere else. Go figure. It wouldn’t hurt if I could off-load some of this to a C routine, but it couldn’t just be a reproduction of the Python routine; unless it’s the interpreter that’s the bottleneck.

Well, what is actually going in to memory on the graphics card? Is it thousands of copies of the texture? If so, we could just flood or override those individually. Or, could we create two textures each from the rasters, then toggle between those on the cards?

My vertices are discontiguous, predominantly, unfortunately.

I see. Then the node that was originally scale-1 must be remaining full size behind the one I’m bringing up. So can I just use say .9999, or should I clear them all to 0 and go back for the ones that start at 1?