Problems making a Tic-Tac-Toe bot

This page summarizes the projects mentioned and recommended in the original post on /r/reinforcementlearning

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • gym

    A toolkit for developing and comparing reinforcement learning algorithms.

  • First, you should structure your code to something more standard. Specifically, reinforcement learning is typically modeled as a Markov decision process (MDP). You should write your code to follow this model, as it will make debugging (either from your end or for others) easier as well as allowing your implementation to be compatible with other frameworks. Check out OpenAI Gym to see how they structure things. Obviously if you're just experimenting/learning then this is less critical, but you'll be saving yourself a lot of grief if you get into the habit sooner rather than later.

  • open_spiel

    OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.

  • I'm not sure if this is helpful, but there is a reference implementation in OpenSpiel. If you look at python/examples/breakthrough_dqn.py and change game = "breakthrough" to game = "tic_tac_toe", and then env_configs = {}, this runs DQN and evaluates against random. After 200k episodes, I'm getting average reward 0.995 for player 1 and around 0.9 for player 2. If you average over seats that's 0.95, but this is in [-1,1] so that's 0.975 in [0,1], Keep in mind that's not really the percentage of wins sin it's not separating wins, draws, and losses-- thought you can easily modify that if you wast-- but the win percentage won't be any higher than that. The code base also has Connect Four so you can try that too. And if you really want the real upper-bound, you can solve the game using value iteration to get the optimal policy and then do expectimax over a random policy versus the optimal greedy policy (or estimate it from many Monte Carlo samples). The win percentage may still rise with more training, but no matter how you slice it, you're not going to get to 100% wins :)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts