I’m currently using PyOpenGL for Occlusion querying with the DisplayRegion.setDrawCallback
The current problem I face is some intense stalling with the panda3d program probably because there is some sort of weird driver flush issue. Note that the display region only holds an offscreen quad that is getting rendered via a FBO.
The goal of this is that a shader is attached to the quad and does some calculations which lead to some pixels getting discarded. Then I use an occlusion query to count the pixels that have gotten rendered as a sort of atomic counter.
I am trying to stay in OpenGL 3.30 / 4.1 for support reasons so compute shades aren’t an option for me. Note that the fragment counting does indeed work but there is a bunch of stalling going on. I’ve tried using bigger Occlusion Query Pools but nothing worked. I have also tried using the CallbackNode but I couldn’t even get a fragment count working from it.
How can I get rid of this massive stall overhead. It’s killing performance.
self.fragment_count = 0
self.query = glGenQueries(1)
self.querey_id = 1
self.available = False
self.dr.setDrawCallback(self.draw_callback_dr)
taskMgr.add(self.check, "checking", sort = 20)
def check(self, task):
if self.available is True:
if glGetQueryObjectiv(self.querey_id, GL_QUERY_RESULT_AVAILABLE):
count = glGetQueryObjectuiv(self.querey_id, GL_QUERY_RESULT)
self.fragment_count = int(count)
print(f"Fragments rendered: {count}")
self.available = False
return task.cont
def draw_callback_dr(self, cbdata):
#glEnable(GL_DEPTH_TEST)
#glDepthFunc(GL_LEQUAL)
if self.available is False:
query_id = int(self.query)
glBeginQuery(GL_SAMPLES_PASSED, query_id)
cbdata.upcall()
if self.available is False:
glEndQuery(GL_SAMPLES_PASSED)
self.querey_id = query_id
self.available = True
This is a fundamental limitation of occlusion queries. Normally, all the draw work gets put in a big queue that the GPU processes sequentially, but when you ask for the result of an occlusion query, OpenGL has to wait for the GPU to finish all of the currently queued work.
There is no good way around this. You can try to do the queries early on in the frame, do some other drawing work (rendering shadow passes, etc.), and then get the results of the queries but likely you’ll still see a stall. Or you can decide to use the resultts of the previous frame’s queries in the current frame but it will not exactly match the camera position and you might see artifacts.
The only way this kind of approach will work well is if the CPU never needs to see the results of the queries, by doing all culling on the GPU, but this is difficult to achieve.
and could do the thing I wanted by counting the fragments through a ram image. I actually did this first but found it to also stall very heavily. I was using an FBO and a C extension I made to count the pixels from a buffer to see how many passed and could maybe consider using something like the post said where I can request an async ram FBO request where I wait till the ram data is available without stalling. Could this be done? From the post it only looks to be a screenshot at the moment.
You’d need to build a time machine to get the results of occlusion queries for work that hasn’t been (or has only just been) submitted to the GPU yet. A stall is inherent in what you’re trying to do: if you used an async query you’d just move the wait to your own code, assuming that you need the results to continue rendering that frame.
Actually, OpenGL queries are asynchronous, in that you can check whether results are available yet and only wait when you know that they are. That only helps if you don’t need the results right away; ie. you’re fine with only unit the results 1-2 frames later. That’s the only way to avoid a stall.
What async screenshots are useful for is when you don’t care about the results only being available a couple of frames later; the moment you do a wait() on an async screenshot request you also introduce a stall.
It occurred to me, there is actually an OpenGL feature called “conditional rendering” where you’re telling the GPU, only render these next objects if the query passed. That’s a way you can use occlusion culling without a stall, though the savings will be modest since you still have to send the rendering commands for those objects; it’s only going to help if you’re fill-rate bound, like if you have an expensive fragment shader.
What are you actually trying to do? I’ve been assuming you’re trying to use this to do occlusion culling.
Actually I’m making a hardware instance culler supported by OpenGL 3.30. I have a 1d offscreen quad, fragment shader where each fragment index is a pointer to a single Instances transform data. If the Instance passed the culling test, It gets a 1.0 or a color for the fragment. If the Instance failed the culling test, then its fragment gets a black color or is discarded depending on what I’m using (Occlusion Query’s or a ram FBO). The only problem I face is how do I count how many instanced passed the test so I can tell the Node to SetInstanceCount(amount) without creating a massive stall.
To be honest, it would actually be Ok if a got the data like 2 frames later and could just increase the bounds radius per instance so they can pop in sooner as a sort of prediction. In my research, found these two solutions so far. Maybe there might be a better way of doing this (ignoring compute shaders and OpenGL 4.2+ features).
Maybe even going back to CPU culling could be better but this is considering thousands of instances and a few cameras from lights that have to be considered. So keeping it on the GPU sounds better at the moment.
Oof, that is a challenging constraint. In OpenGL 3.3, the instance count must come from the CPU. Are you okay with using OpenGL 4.0 features? That would allow you to use indirect draw buffers. Not sure how you’d fill them, though, without image load/store. Maybe transform feedback? And you’d need some way to transfer the result of the query to your culling shader - conditional rendering is the only way I can imagine that would work, like conditionally rendering a single pixel in your instance buffer, but that can’t be very efficient since it’d have to be done per instance, which kind of defeats the point of instancing.
Otherwise, the only solutions I see: oversize the instance count and discard the culled instances somehow (have the vertex shader transform them into oblivion), or live with a delayed (and inaccurate) count.
OpenGL 4.3 makes all this a lot easier by letting you read/write arbitrarily from/to buffer objects, reading query objects and constructing draw lists in shaders.
When you mean OpenGL 4.0 features is this counting OpenGL 4.1 or higher. Because my main priority is getting stuff working on Mac. That’s kind of the reason I’ve stuck to these lower versions (and also to support those old laptops on life support). Anything higher for me will probably be a no go, but maybe I could make a a main option and a fallback option to support old systems. Still it would be a nice to get just one version of this working for all systems.
What is this OpenGL 4.0 feature you mentioned?
Edit:
Could we do something with geometry shaders? maybe count how many triangles where where produced by the shader? I’m not familiar with how geometry shaders work so this might not even be a good idea.
Honestly, maybe it would be good to just get the ram image delayed through the async method. I think this will be good enough. It wont affect much at a high frame rate of like 60. I just don’t know how to do this. How do you check and request the texture availability.
The main challenge is writing that counter to a buffer that you can use as a source for draw calls without round-tripping through the CPU. It requires something like image load/store, SSBOs, atomic counters, query buffers, or what have you - features that generally require something like OpenGL 4.2-4.4.
But I don’t think you should dismiss the idea of just rendering “too many instances” and discarding the ones that are culled by outputting degenerate geometry in your vertex shader. I think it may offer better performance than you think (you’re still skipping much of the shading pipeline for the culled instances) and it would certainly be the least work to implement, it works with OpenGL 3, and you have no glitches from data being a frame behind.
I looked a bit more into transform feedback, which is available in OpenGL 3 (with some more features offered by 4.0 which is also supported on macOS). It feels like glDrawTransformFeedbackalmost solves your problem, but it takes the primitive count from a GPU-rendered buffer, not the instance count. You’d still have to write the primitives for all instances to that buffer, at which point it just moves your problem and incurs a big extra overhead for no gain.
Yes, you can use a geometry shader and count the number of primitives drawn using a query, but it doesn’t solve the issue of how to use that as an instance count; you’d still incur a stall getting the result of that. You’d just be stalling on a different query. It’s arguably cleaner than rendering to a texture, but doesn’t solve the fundamental problem.
It’s frustrating because OpenGL 4.0 almost provides the tools for you to be able to do this; glDrawArraysIndirect/glDrawElementsIndirect exist, but there’s no way for you to write the instance count to the buffer - you’d need at the earliest atomic counters / image load/store (OpenGL 4.2) or more ideally query buffer objects (OpenGL 4.4).
If you want to use delayed (async) querying, then you’re already (mostly) set up to do that, as queries are already async objects; you just have to buffer the query objects yourself (ie. storing a kind of a ring buffer of 3 query objects per rendered object that you cycle between every frame) so that there’s a natural delay of 2 frames between issuing the query and getting its result.
I have tested the ‘vertex discard’ method and in my tests (well on my hardware), it made negligible difference to performance. The currently stalling FBO approach did much more even with the stall compared to the vertex discard large instance count method.
I tested both FBO and Fragment Query’s (both with stalls) and found that copying the texture to ram was significantly faster than using Occlusion Query’s. I was thinking it might be best to stay with FBOs and maybe just use the delayed frame method.
I did also try to make an asynchronously Occlusion Query chain of 3 query’s but still got stalls somehow. I don’t know why but for some reason the ring buffer still caused stalls not in receiving the query data, but in creating the glBeginQuery(GL_SAMPLES_PASSED, query_id) which caused a pipeline flush every time. This has probably got something to do with the driver but it still confused me why it didn’t work. Maybe injecting raw gl calls into panda3d disrupts it pipeline?
Anyway, I don’t know which one is faster anymore if I were to do it asynchronously. In theory, Occlusion Query’s should be faster but from these tests, textures are faster.
I was also thinking of pooling culling textures into like one big texture atlas where each row in the texture is for one instance root parent. This could reduce the amount of fetches to just one global fetch from the GPU instead of having a unique fragment shader for each instance root node. Its kind of hard to explain but this system could be summed up as a global shader that culls every instance root node in one pass / call. like for example, i could have a tree model node which is assigned the first row of the texture, then I have a stone model node that gets assigned the second row of the texture. However if I do the texture fetch from the GPU asynchronously, I’m not sure if there is a benefit to pooling the culling into one big texture.