Some thoughts/observations:
- Disable vsync when profiling to avoid weird “stair-steps” in fps
- Frame time is better for profiling since it is linear (going from 30fps to 40fps is a much bigger improvement than going from 300fps to 310fps, so a difference of “10fps” means nothing without context)
- Multi-threading in Python isn’t going to use more CPU cores because of Python’s Global Interpreter Lock (Gil); multiprocessing can work around this, but it read its own issues
- NumPy will help get an lot more out of a single thread
- Computer shaders are well suited to this kind of thing, but the learning curve can be steep