Nice!
And indeed a lot of Ss 
Thanks for the code!
I agree. In my opinion, too, that particular blur implementation does not work very well when the input has sharp edges. Offset copies of the edges will show in the output, because it uses so few samples. In LensFlare it looks decent, but for most other applications I think more samples are needed.
You could use a two-pass blur like the BlurSharpen filter does. At least I got good results in the LocalReflection filter when I replaced the hardcoded single-pass fast gaussian blur with the two-pass approach from BlurSharpen. (LocalReflection actually includes both blur modes so you can compare - single pass uses the fast hardcoded gaussian, while two-pass uses the algorithm from BlurSharpen.)
Granted, the current BlurSharpen filter uses a rectangular kernel (as in convolution kernel in signal processing), which may or may not do what you want. If you specifically want a gaussian, it wouldn’t be too hard to modify BlurSharpen into an efficient two-pass gaussian. This basically only needs rewriting the blur kernels (as in computational kernel i.e. shader) - the control logic can be kept as it is.
The basic idea of the two-pass blur is to notice that the 2D blur kernel is a cartesian product of two 1D blur kernels (along the x and y directions), or in other words K(x,y) = K1(x)K2(y), so you get the same result if you apply two 1D blurs in succession (first blur along x, then blur the result along y, or the other way around - the ordering doesn’t matter). This reduces the number of taps required from N**2 (sample every pixel in stencil) to 2N (sample along two lines), where N is the blur radius in pixels. By further utilizing the “free” (hardware-based) linear interpolation of GPUs, by sampling at cleverly chosen points between pixels and weighting the results appropriately, you can reduce this further to around N taps.
(* Strictly speaking, the side-length of the square-shaped stencil.)
On top of this, depending on the application, you may be able to save a further ~75% of run time in the blur passes by first downsampling the blur input into quarter resolution. As a further bonus, this doubles your blur radius (as measured in screen estate) at the same kernel size (as measured in pixels).
Since we are applying a blur, and thus there shouldn’t be any sharp edges in the output, the lower resolution might not be noticeable in the result. When you do the compositing onto the final image, the quarter-resolution blur texture is then automatically bilinearly interpolated by the GPU back to full resolution.
If you choose to use this downsampling approach, you may find useful the observation that it greatly simplifies the blur shaders if you first apply a separate downsampling pass (making this into a three-pass algorithm), because this allows you to assume a 1:1 mapping between input and output pixels in the actual blur passes. This extra pass is almost required if you want to use the linear interpolation technique, while keeping the shader code as simple as possible.
The two-pass technique with GPU-based interpolation is explained in detail in this article http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/, which I think ninth linked way back in another thread. You can get some more ideas from this comparison of fast gaussian blur algorithms http://blog.ivank.net/fastest-gaussian-blur.html, although this second article focuses on a CPU-based implementation.
I’ve been meaning to do this (I’d like to have the gaussian blur kernel as an option), but I thought I needed a break from postprocessing, so for a change I’m currently looking into making a faster hair physics simulator. 
(On which note, I think you have been experimenting with something Bullet-based along the same lines? Maybe we could combine forces, before I go reinventing the square wheel. I have in mind one way of making a fast(er) hair simulator, that I’ve been discussing with rdb, but it still involves custom physics code.
Using Bullet would have the advantage of having a pre-made collision system available, so that I wouldn’t have to duplicate one of the messiest parts of Newtonian physics, namely contact mechanics. It involves a lot of tricky geometric considerations and not much actual physics - at least when considering the simplest variants with Coulomb friction and the perfect slip model often used in games.)