Iterative policy evaluation on FrozenLake-v0 (Python)

In this example, we use the iterative policy iteration algorithm to train an agent on the FrozenLake-v0 environment. In PyCubeAI there is a tabular implementation of the algorithm implemented in the IterativePolicyEvaluator class.

Imports needed to run the example.

import gym
import numpy as np
import matplotlib.pyplot as plt
from src.algorithms.dp.iterative_policy_evaluation import IterativePolicyEvaluator, DPAlgoConfig
from src.policies.uniform_policy import UniformPolicy
from src.algorithms.rl_serial_agent_trainer import RLSerialTrainerConfig, RLSerialAgentTrainer
from src.worlds.world_helpers import n_states, n_actions

Next we implement a helper function for plotting the value function

def plot_values(v: np.array) -> None:

    # reshape value function
    V_sq = np.reshape(v, (4, 4))

    # plot the state-value function
    fig = plt.figure(figsize=(6, 6))
    ax = fig.add_subplot(111)
    im = ax.imshow(V_sq, cmap='cool')
    for (j, i), label in np.ndenumerate(V_sq):
        ax.text(i, j, np.round(label, 5), ha='center', va='center', fontsize=14)
    plt.tick_params(bottom=False, left=False, labelbottom=False, labelleft=False)
    plt.title('State-Value Function')

Finally, we put everything together. The driver starts with a UniformPolicy instance which we initalize with the number of states and actions of the environment.

if __name__ == '__main__':

    env = gym.make("FrozenLake-v0")

    policy_init = UniformPolicy(n_actions=n_actions(env),  n_states=n_states(env))

    agent_config = DPAlgoConfig()
    agent_config.gamma = 1.0
    agent_config.tolerance = 1.0e-8
    agent_config.policy = policy_init

    agent = IterativePolicyEvaluator(agent_config)

    config = RLSerialTrainerConfig()
    config.n_episodes = 100

    trainer = RLSerialAgentTrainer(agent=agent, config=config)

    ctrl_res = trainer.train(env)

    print(f"Converged {ctrl_res.converged}")
    print(f"Number of iterations {ctrl_res.n_itrs}")
    print(f"Residual {ctrl_res.residual}")


Running the driver code above produces the following output

INFO: Episode 0 of 100, (1.0% done)
INFO: Episode 10 of 100, (11.0% done)
INFO: Episode 20 of 100, (21.0% done)
INFO: Episode 30 of 100, (31.0% done)
INFO: Episode 40 of 100, (41.0% done)
INFO: Episode 50 of 100, (51.0% done)
INFO: On Episode 56 the break training flag was set. Stop training
INFO Done. Execution time 0.013857874000677839 secs
Converged True
Number of iterations 57
Residual 1e-10

The image below shown the value function produced