As promised, here is what I’ve pieced together (none of it is original, I can take no credit). It is based on a simplified version of a real world problem: ordering or ‘making flat’ a group of objects so that none are on top of each other on a conveyor belt. In my setup, this is accomplished via a computer controlled ‘finger’ that is just above 1 object height.
Every time an object is flattened, the AI gets a reward. This is how the AI learns what to do (and nothing else!). As input, the AI ‘sees’ what we see in this clip:
This was after training from scratch for 55 hours of simulation play.
Here is the code I’ve assembled (again, credit goes elsewhere).
It is heavily based off of this guys’s work.