Introduction
One of the best forms of machine learning is reinforcement learning, and if a deep neural network is working with it, then it is capable of becoming a fully automatic robot.
An AI technique that combines reinforcement learning (RL) and deep learning so that the user can identify even the biggest problems at once and feed them into the machine to solve them.
Neural Networks
It is very difficult to make a good neural network. To improve the heuristic, we usually have to play the game many times, so that we can find out whether the user can select another and better option or not. It is very difficult to do this because what mistake is he making, and if he is making a mistake, what new is he learning from that mistake?
If we have such a network in which we can take the right design of the game, then in this article, we will replace a heuristic with a neural network.
The network takes the past in one way as input and then gives an output.
![Output]()
The user makes a move by measuring all the inputs. For example, for the board in the picture above, the user makes a new move with a 50% prediction. To encode a good automatic technique, we just need to adjust the weights of the network so that for each game board, it assigns higher probabilities to better moves.
At least our goal is to make the game continue automatically even if the player does not choose any option the next time. This is what a reinforcement model does.
Setup
We will look at the techniques for how a network distributes its weights.
- After every step, we will give the agent a price or point so that the agent feels that he has done a good job.
- If the agent wins the game, we will give them +10 reward points
- If the agent makes a wrong move in the game, we will deduct -5 points from their reward points, and the game will end
- If the opponent wins the game in the agent's next move (meaning the agent fails to stop his opponent), we give the agent a reward of -10.
- Else, the agent gets a reward of 1/6*7.
- At the end of every game, the agent receives a reward point, which refers to the sum of reward points as the agent's cumulative reward.
- For example, if the game lasted 7 moves (each player played four times), and the agent won at the end, then his total prize would be 3*(1/6*7)+ 10
- If the game lasted 11 moves (and the opponent moved first, so the agent played five times), and the opponent won on his last move, the agent's total reward would be 4 * (1/6*7) - 5
- If the game ends in a tie, the agent has played exactly 21 moves and receives a total cumulative reward of 21*(1/6*7).
- Or if the game lasts 10 moves and the game is won by the agent making the wrong move, then the agent gets a cumulative reward of 3*(1/6*7) - 5.
Our target is to calculate the total points of the agent, which will tell us how well our model is working. Giving points is a good way to show the agent. Once I describe the issue, we have many algorithms for resolving the issue.
Reinforcement Learning
There are some major reinforcement learning algorithms, such as DQN, A2C, and PPO, etc. All these algorithms use the same technique to build agents.
- All weights will be set to random values initially
- As the agent plays the game, the algorithm continually tries new values for a weight to see how the cumulative reward is affected on average. Over time, playing the game many times determines what the next new prediction move should be, and the algorithm moves toward the weights that work best.
- We might have missed out a lot of things because many hidden layers are added while creating a model which we can describe together.
- This way we will work with the agent and finish a game while letting the agent win (so he gets the final reward of +10, and avoids -1 and -10) and try to make the game last as long as possible (so he can collect the 1/7*6 bonus as many times as possible).
- We can give the idea that if the game is going on for a long time then it is not worth running it for so long because there will be nothing new to learn in the game and the game will get stuck which prevents the agent from playing the obvious winning moves early in the gameplay. And, your intuition will be correct - it will take longer for the agent to make the winning moves! We include the 1/42 bonus to help the algorithms we are using get better but you can learn more by reading about the "temporal credit assignment problem" and "reward shaping".