Texture formats and compute shaders

Hi all,

As you may know, I’ve been working on a new postprocessing filter framework for Panda 1.9. (Forum thread: [url]CommonFilters - some new filters, and the future] )

I’m thinking that maybe some time after 1.9, Panda could use a depth-of-field filter. It seems that currently the best technique for semi-fast, realistic depth of field is currently the one by Pixar, originally meant for film editing previews: http://graphics.pixar.com/library/DepthOfField/paper.pdf

The paper does not mention a publication date, but the latest references it makes to literature are from 2005. Thus it seems likely that the paper was published some time in 2006 or maybe 2007. The computational power of GPUs has significantly increased since then, so I think it would be worth investigating whether the algorithm is fast enough for realtime applications on current hardware.

The algorithm is based on the diffusion equation. This makes sense - blur can be thought of as a diffusion process (usually an isotropic one). Using a diffusion equation with variable coefficients makes variable width blur easy, if one knows how to numerically solve partial differential equations (PDEs).

Of course, to be fast, the PDE solver must run on the GPU, and utilize the available massive parallelization. Also, keeping the data in the GPU memory for the whole calculation is essential in order not to unnecessarily consume PCI-E bandwidth.

GPU parallelization of diffusion solvers, as of early 2015, mainly relies on a technique known in the numerics community as ADI (Alternating Direction Implicit, a classic operator splitting method from the 1950s; see e.g. Wikipedia: http://en.wikipedia.org/wiki/Alternating_direction_implicit_method ), which makes the linear equation system tridiagonal. This is usually combined with some variant of cyclic reduction ( http://people.mpi-inf.mpg.de/~strzodka/papers/public/GoSt11CR.pdf ), which is a parallel algorithm for the solution of tridiagonal linear equation systems. This is indeed the approach proposed in the paper by Pixar.

So that’s the background - now to the point. I would like to try to implement this, but there are some practical issues I could use some help with.

First, partial differential equations tend to require a lot of precision in the intermediate computations. Is it possible to use float textures in Panda? (Even if it turns out they are not needed for this particular application, I’m also planning something else for later that will definitely need them.)

Also related to texture formats, are 3D textures supported? (Not needed in this particular filter, but this would be useful for volumetric applications, especially 3D PDEs.)

Secondly, is it possible to tell Panda to render only a part of a viewport using a particular shader? PDE solvers in OpenGL have been commonly implemented by rendering the fragments residing on the domain boundaries using a different shader (that implements the boundary conditions), while rendering the interior of the domain using a shader that operates on the interior (typically, it is assumed that interior points have neighbors in all directions). See e.g. section 38.3.2 in http://http.developer.nvidia.com/GPUGems/gpugems_ch38.html

And finally, compute shaders seem the ideal choice for implementing the cyclic reduction for the tridiagonal solver. Fortunately, I happened to upgrade my GPU over the holidays (now running a Radeon R9 290), and it supports compute shaders, so now I should be able to play around with them.

What the algorithm needs to be able to do is, at each step, to write to a target that is half size (compared to the step’s input) along one axis only, while retaining the original resolution along the other axis. The number of steps required depends on the dimensions of the viewport that is being rendered.

Cyclic reduction is based on eliminating the odd-numbered unknowns - each step halves the number of remaining unknowns. (It is explained better in the papers linked above.) The reduction continues until there are only one or two unknowns left. If one, its value can then be read directly, and if two, the remaining 2x2 system can be solved explicitly. Then, a similar backward process computes the final answer, at each step doubling the number of knowns until the original size is reached. (Some practical complications arise for non-power-of-two sizes, but I think those are solvable - if nothing else, these particular textures could be padded when they are allocated.)

So, the question is: any suggestions as to what documentation, code examples or similar I should read, in order to figure out how to insert the rendering of this “shrinking” texture sequence at an arbitrary point in the overall render sequence?

I’m asking because at the moment I can’t wrap my head around how compute shaders would interact with Panda’s FilterManager, which already controls the render order of any created postprocessing buffers. I already know how to render postprocessing buffers in the desired order using regular shaders - the new postprocessing framework does exactly that - so the question is, how to mix in some compute shaders.

I.e. I would like a setup where some filters first render using the regular kind of shaders, then this renders using some compute shaders and some regular shaders, and then other filters (using regular shaders) continue from there. The important thing is to be able to invoke the compute shaders at an arbitrary step of the overall render sequence.

If anyone knows, input would be appreciated :slight_smile:

These features are all supported by Panda.

As for float textures, just set a texture component type to float, and set an appropriate format from the list.
As for 3-D textures, this is described in the manual:
2-D array textures are also supported in a similar fashion.
As for compute shaders, this is described in detail in the manual:
panda3d.org/manual/index.ph … te_Shaders

I don’t really understand your question about using a shader on a part of the viewport. That all depends on the shape and posistion of the mesh you’re rendering. You can also discard fragments from a fragment shader.

Keep in mind that compute shaders are slower than fragment shaders for most image processing purposes, and may be overkill for something that fits very well within the scope of a fragment shader. I haven’t read the paper, but it might just suffice to write a fragment shader to do each reduction step. The whole thing reminds me a bit of mipmapping, in fact, which is done along similar reduction principles.

As for adding a compute shader step in the FilterManager; the simplest way would simply be to replace the full-screen quad in a particular pass with a ComputeNode configured appropriately, although you don’t technically need a bound buffer for compute shaders to work.

Ok. Thanks!

Thanks for the keywords :slight_smile:

So, based on [url]Setting TFloat Texture from Numpy Array] and https://www.panda3d.org/reference/1.8.1/python/classpanda3d.core.Texture.php, maybe something like:

mytex = Texture("my_procedural_texture")

choosing TFloat for ComponentType, and format… I suppose for arbitrary scalars, any one of FRed, FAlpha or FLuminance should be ok, but it seems there’s no format for float3 or float4 vectors? Or am I missing something? What happens with the combination TFloat and FRgba32?

(From a data storage and memory access perspective, it would be ideal to store the three values on each row of the tridiagonal matrix in the .xyz components of a float3. Or even better, using a float4, store the row and the corresponding element of the load vector (right-hand side in the linear equation system) into .xyzw.)

Thanks for the specific links! I’d previously read the page on compute shaders, but had missed the one on 3D textures.

Ah, that’s true. That is indeed how it was done in the GPU Gems link that I posted. The geometry was set up so that each computation kernel (implemented as a fragment shader) operates on the appropriate fragments of the buffer.

How about 3D textures, i.e. how to update a selected subset of cells? For a 3D PDE application, it seems inefficient to run the boundary shaders (for a cube domain, all 6 of them) on the whole domain and discard almost all fragments.

(For a time-dependent PDE simulation in 3D space, at each frame (time step), all voxels in the 3D texture need to be updated. The choice of the computational kernel to use for each particular cell depends on whether the cell belongs to the interior of the domain, or to one of its boundaries. The branch can be resolved statically in the code that invokes the shaders, but to do that, one needs to be able to invoke each shader on a subset of cells.)

Hmm, good point. I’ll investigate if the tridiagonal solve can be done using regular fragment shaders.

If I understood correctly, it should be enough to configure the component format of the intermediate buffers as float (if that is even necessary in this very special application), create a fullscreen quad for each (imitating FilterManager), and then render in the usual fashion.

The number of steps required depends on the pixel dimensions of the original viewport, so resizing the window while this filter is running may be problematic, but I suppose that can wait for later. The postprocessing framework could be later extended to provide an onResize() event, to which the depth-of-field filter would react by requesting a pipeline reconfigure (so that the internal texture buffers are re-generated before the next frame is rendered).

Ok. Thanks for the tip!

Well, you could just run your shader on a smaller viewport/work group size, and adjust the coordinates of the inputs you’re sampling accordingly. I don’t really see the problem so it could just be that I’m not understanding it correctly.

You can use F_rgb16, F_rgba16, F_rgb32, F_rgba32 for those.

To use floatingpoint render-to-texture, simply call setFloatColor(True) on the FrameBufferProperties. For aux attachments, you use setAuxFloat(n).

Panda also supports rendering to 3-D textures in 1.9. You would have to combine this with layered render-to-texture and hardware instancing; use a geometry shader to replicate a quad across each layer.

I think I’ll have to get back to this later when I actually start building some 3D computation kernels. May be that I’m misunderstanding something, too.

Ok, thanks.

Ah, ok. Sounds like it might be useful for me to extend FilterManager a bit when I get to this point. (In the 2D postprocessing setting, it’s done such a good job that I haven’t had to touch FrameBufferProperties manually.)

This sounds nice. I think I’ll have to dig up some examples on geometry shaders and GLSL in the somewhat near-ish future…

One last question related to this. In iterative methods, it is necessary to recycle the same physical buffers, because the algorithm may run for dozens or even hundreds of iterations (and each intermediate texture is needed only once). Using FilterManager (or an approach similar to that used in FilterManager), is it possible to “ping-pong” the render between two textures, e.g. by creating several buffer objects but pointing to the same texture objects?

Also, I suppose that using traditional shaders, it is not possible to vary the number of such iterations dynamically (i.e. based on something that is computed from the resulting texture)?

Hmm, you should be able to bind a texture to two buffers, and then cycle setActive() between them, as long as the buffers have the exact same size and settings. I haven’t tried it, though.

Varying the number of iterations based on a computed result sounds tricky indeed, hmm.

Ah, is the intention that during the rendering of one frame, only one of the buffers gets updated?

(Based on my reading of the API doc https://www.panda3d.org/reference/1.8.1/python/classpanda3d.core.DisplayRegion.php , it seems that this strategy would do that.)

I was actually thinking several iterations during the rendering of one displayed frame. For this case, would it work to create as many buffer objects as one wants iterations per frame (let’s call this number n), and assign the texture objects as their render textures in a round-robin fashion? I assume the buffers and quads won’t use much memory (compared to the texture data)?

For example, in 2D, using FilterManager, something like:

n = 20
ntex = 2

textures = [ Texture() for k in xrange(ntex) ]
tidx = 0

# ...more setup code goes here...

quads = []
for k in xrange(n):
    myquad = manager.renderQuadInto(colortex=textures[tidx])
    myquad.setShader( Shader.make(MY_ITERATIVE_SHADER) )
    quads.append( myquad )
    tidx = (tidx + 1) % ntex

Based on reading the source of FilterManager, the supplied Texture objects (which are cycled in the above loop) get passed to buffer.addRenderTexture(), where buffer is a return value from base.graphicsEngine.makeOutput(). With this setup, the code should create n quads and corresponding buffers, but only ntex textures.

Is this supported, or is there some reason it won’t work?

(As for the maximum value of n that would be needed, maybe around 50.)

In scientific computing, terminating iteration dynamically based on some metric (e.g. some norm of the residual or that of the difference of two successive iterates) is rather common.

As I’ve understood, in GPU computing setups this is usually done in a two-step fashion. First, the result data is reduced into successively smaller textures (using an appropriate shader), until the render pass overhead would become too large (and the GPU is no longer even nearly saturated), at which point it becomes cheaper to send the last texture to the main memory, and let the CPU perform the final reduction (and apply control logic based on the result).

This probably doesn’t affect my plans, however - in cases where traditional shaders are used, I’ll just need to keep in mind to choose numerical methods that do not require dynamic termination. (Whether this is possible, depends on the problem, but for many practically useful problems this should not be an issue.)