I’ve managed to complete a new lower bound construction for Fearnley’s non-oblivious policy iteration method. See Fearnley’s paper here. See my preprint here.
The basic idea of Fearnley’s rule is to remember such intermediate cycle structures (it is a bit more general than that, but you get the idea) as I’ve used in my previous constructions whenever they occur. Then, whenever it is profitable, they are reapplied again. As it turns out, the binary counting behaviour of the original constructions breaks down, resulting in a quadratic number (educated guess, might be a little higher) of iterations.
My new subexponential lower bound for Fearnley’s rule works by introducing an exponential number of different intermediate cycles (rather than just a polynomial number). The idea is to have
different cycle configurations for the first bit,
configurations for the second bit and so on, where there is only one applicable configuration per global binary counter state. More precisely, the applicable cycle configuration of bit
depends on the bit settings of all higher bits
. Hence, no matter which cycles are learned by the rule, they will never be applicable again, and therefore learning is of no use here.
To give you a rough idea of how this looks like, I’ve made some animations where you can see what the algorithm is actually doing on the game:
Let me know what you think.