I’m currently using PyOpenGL for Occlusion querying with the DisplayRegion.setDrawCallback
The current problem I face is some intense stalling with the panda3d program probably because there is some sort of weird driver flush issue. Note that the display region only holds an offscreen quad that is getting rendered via a FBO.
The goal of this is that a shader is attached to the quad and does some calculations which lead to some pixels getting discarded. Then I use an occlusion query to count the pixels that have gotten rendered as a sort of atomic counter.
I am trying to stay in OpenGL 3.30 / 4.1 for support reasons so compute shades aren’t an option for me. Note that the fragment counting does indeed work but there is a bunch of stalling going on. I’ve tried using bigger Occlusion Query Pools but nothing worked. I have also tried using the CallbackNode but I couldn’t even get a fragment count working from it.
How can I get rid of this massive stall overhead. It’s killing performance.
self.fragment_count = 0
self.query = glGenQueries(1)
self.querey_id = 1
self.available = False
self.dr.setDrawCallback(self.draw_callback_dr)
taskMgr.add(self.check, "checking", sort = 20)
def check(self, task):
if self.available is True:
if glGetQueryObjectiv(self.querey_id, GL_QUERY_RESULT_AVAILABLE):
count = glGetQueryObjectuiv(self.querey_id, GL_QUERY_RESULT)
self.fragment_count = int(count)
print(f"Fragments rendered: {count}")
self.available = False
return task.cont
def draw_callback_dr(self, cbdata):
#glEnable(GL_DEPTH_TEST)
#glDepthFunc(GL_LEQUAL)
if self.available is False:
query_id = int(self.query)
glBeginQuery(GL_SAMPLES_PASSED, query_id)
cbdata.upcall()
if self.available is False:
glEndQuery(GL_SAMPLES_PASSED)
self.querey_id = query_id
self.available = True
This is a fundamental limitation of occlusion queries. Normally, all the draw work gets put in a big queue that the GPU processes sequentially, but when you ask for the result of an occlusion query, OpenGL has to wait for the GPU to finish all of the currently queued work.
There is no good way around this. You can try to do the queries early on in the frame, do some other drawing work (rendering shadow passes, etc.), and then get the results of the queries but likely you’ll still see a stall. Or you can decide to use the resultts of the previous frame’s queries in the current frame but it will not exactly match the camera position and you might see artifacts.
The only way this kind of approach will work well is if the CPU never needs to see the results of the queries, by doing all culling on the GPU, but this is difficult to achieve.
and could do the thing I wanted by counting the fragments through a ram image. I actually did this first but found it to also stall very heavily. I was using an FBO and a C extension I made to count the pixels from a buffer to see how many passed and could maybe consider using something like the post said where I can request an async ram FBO request where I wait till the ram data is available without stalling. Could this be done? From the post it only looks to be a screenshot at the moment.
You’d need to build a time machine to get the results of occlusion queries for work that hasn’t been (or has only just been) submitted to the GPU yet. A stall is inherent in what you’re trying to do: if you used an async query you’d just move the wait to your own code, assuming that you need the results to continue rendering that frame.
Actually, OpenGL queries are asynchronous, in that you can check whether results are available yet and only wait when you know that they are. That only helps if you don’t need the results right away; ie. you’re fine with only unit the results 1-2 frames later. That’s the only way to avoid a stall.
What async screenshots are useful for is when you don’t care about the results only being available a couple of frames later; the moment you do a wait() on an async screenshot request you also introduce a stall.
It occurred to me, there is actually an OpenGL feature called “conditional rendering” where you’re telling the GPU, only render these next objects if the query passed. That’s a way you can use occlusion culling without a stall, though the savings will be modest since you still have to send the rendering commands for those objects; it’s only going to help if you’re fill-rate bound, like if you have an expensive fragment shader.
What are you actually trying to do? I’ve been assuming you’re trying to use this to do occlusion culling.
Actually I’m making a hardware instance culler supported by OpenGL 3.30. I have a 1d offscreen quad, fragment shader where each fragment index is a pointer to a single Instances transform data. If the Instance passed the culling test, It gets a 1.0 or a color for the fragment. If the Instance failed the culling test, then its fragment gets a black color or is discarded depending on what I’m using (Occlusion Query’s or a ram FBO). The only problem I face is how do I count how many instanced passed the test so I can tell the Node to SetInstanceCount(amount) without creating a massive stall.
To be honest, it would actually be Ok if a got the data like 2 frames later and could just increase the bounds radius per instance so they can pop in sooner as a sort of prediction. In my research, found these two solutions so far. Maybe there might be a better way of doing this (ignoring compute shaders and OpenGL 4.2+ features).
Maybe even going back to CPU culling could be better but this is considering thousands of instances and a few cameras from lights that have to be considered. So keeping it on the GPU sounds better at the moment.
Oof, that is a challenging constraint. In OpenGL 3.3, the instance count must come from the CPU. Are you okay with using OpenGL 4.0 features? That would allow you to use indirect draw buffers. Not sure how you’d fill them, though, without image load/store. Maybe transform feedback? And you’d need some way to transfer the result of the query to your culling shader - conditional rendering is the only way I can imagine that would work, like conditionally rendering a single pixel in your instance buffer, but that can’t be very efficient since it’d have to be done per instance, which kind of defeats the point of instancing.
Otherwise, the only solutions I see: oversize the instance count and discard the culled instances somehow (have the vertex shader transform them into oblivion), or live with a delayed (and inaccurate) count.
OpenGL 4.3 makes all this a lot easier by letting you read/write arbitrarily from/to buffer objects, reading query objects and constructing draw lists in shaders.
When you mean OpenGL 4.0 features is this counting OpenGL 4.1 or higher. Because my main priority is getting stuff working on Mac. That’s kind of the reason I’ve stuck to these lower versions (and also to support those old laptops on life support). Anything higher for me will probably be a no go, but maybe I could make a a main option and a fallback option to support old systems. Still it would be a nice to get just one version of this working for all systems.
What is this OpenGL 4.0 feature you mentioned?
Edit:
Could we do something with geometry shaders? maybe count how many triangles where where produced by the shader? I’m not familiar with how geometry shaders work so this might not even be a good idea.
Honestly, maybe it would be good to just get the ram image delayed through the async method. I think this will be good enough. It wont affect much at a high frame rate of like 60. I just don’t know how to do this. How do you check and request the texture availability.