Week 7: QRB 2 and Reward Functions

This week, our team gave our second Qualification Review Board (QRB) presentation! Unlike the first QRB presentation, this one focuses on our testing plan where we run our model against player bots of increasing difficulty to see how well it performs.

Testing plan diagram by Brian. Our model (TRL) will be tested in 3 sets, each 1000 games long. These sets use player bots of increasing difficulty with R being a very weak player, TVP-VP being a challenging player, and AB being a very challenging player. Each set yields a win rate.

In addition to our testing plan, we also explained our GNN approach to creating an RL model and our long-term goals for the project.

Overall, we were happy with how our presentation was delivered. We worked hard to improve upon our mistakes from the first presentation. We made sure to present our team roles, explain our diagrams better, and use our time more wisely. We also received excellent feedback from the review board. Among their suggestions was the creation of a more interactive demo to better show our project. This will prove helpful in our next Prototype Inspection Day (PID).

As for the current progress on our project, we found some interesting behaviors in our models. As mentioned in a previous post, Reinforcement Learning (RL) models attempt to maximize rewards given by a reward function. However, if the reward function is too simple, the model may behave strangely.

Consider the following example:

def my_reward_function(game, p0_color):
    winning_color = game.winning_color()
    if p0_color == winning_color:
        return 100
    elif winning_color is None:
        return 0
    else:
        return -100

The above reward function gives a positive reward if the player is winning and gives a negative reward if the player is losing. Although this makes sense at first, the reward is calculated at every timestep. Thus, if the model is already winning, the model will want to prolong the game to maximize its reward.

Our models end up collecting 8 or 9 victory points (10 is needed to win), then stop playing until another player beats the game. Ironically, our models lose by trying to stay in a winning position. This isn’t the behavior we want. What we need is a reward function that incorporates both victory points and time.

Next week, our team will continue working toward improving our models. See you then!

Team 03

Team 03

Week 7: QRB 2 and Reward Functions

Leave a Reply Cancel reply

UF Resources

Campus

Website