Understanding Bellman Equation
I’m going through reinforcement learning book by Sutton to get a better understanding of how RL works and build the intuition behind math that powers it.
In a previous post I wrote about how I’ve started brushing up my math fundamentals and it gave me confidence to pick up this book, tackle theory and math and push myself even further.
When I came across this equation, I understood what it was doing at a high level but couldn’t build up reasoning and intuition behind some of the operations that were happening along the way.
It took me roughly 2 weeks before it finally clicked. I had to do multiple readings, going back to fundamentals of each concept that were used implicitly by the author and writing it down numerous times.
This post is an attempt to derive this beautiful equation and what better way to write about this.
What is a Bellman Equation?
The Bellman equation transforms an intractable “look at all possible futures” into a tractable “look one step ahead and just trust your value estimates” problem. This makes RL computationally feasible.
Yes, it looks weird first and that’s what I felt. But we’re going to unpack this slowly and I hope you’d have a better understanding by the end of it. Buckle up!
Basic Setup and Definitions
Markov Decision Process
An MDP consists of:
- States : All possible situations the agent can be in (e.g. positions on a grid)
- Actions : All possible moves the agent can make (e.g. up, down, left or right)
- Transition Dynamics : Probability of landing in state and getting reward when taking an action from state
- Discount Factor : How much we value future rewards vs. immediate rewards
Policy
A policy is a function that tells the agent what to do in each state. The probability of taking an action in state is defined as .
Return
The return is the total discounted return from time onward:
Let’s unpack this notation:
- : Reward received at the next time step
- : Reward two steps ahead, discounted by
- an so on forever…
We apply discount here to define how we much we care about immediate or future rewards. If is 0, we only care about immediate reward. If it is 1, all future rewards matter equally. If it’s between 0 and 1, it means future rewards matter but less than immediate ones.
State Value Function
This function tells us the value of a state if we follow a policy . It’s defined as:
Let’s also unpack this:
- : Value of a state under policy
- : Expected value when following policy
- : Return (total discounted reward)
- : We’re in state at time
State value functions are useful because if we know for all states, we can compare which are better to be in.
Recursive Property of Returns
This is one of the most important insights that makes the Bellman equation possible. We know that the total discount return is defined as:
Let’s factor out from everything except the first term:
If you look carefully what’s inside the parentheses, it’s the total discount return from time , which is . Therefore, we can write as:
Why this is important is because it reduced the discounted return from time to immediate reward the discounted return from time .
Derivation
Now that the basics are out of our way, we’ll derive the Bellman equation step by step.
Step 0: Write the definition
Let’s start with the definition of value function. We’re in state at time . We want the expected total return from this point onward.
Step 1: Substitute the recursive formula
We just showed that , so substitute this in above function.
Step 2: Use Linearity of Expectation
Now, let’s expand the above equation using linearity of expectation so we can deal with individual parts.
We did not do anything fancy here, just separated and and pulled out the constant .
This is still a state value function which translates to (expected immediate reward) + (discount factor x expected future return).
Step 3: Condition on Action
Now, we need to compute and . But the problem right now is, we’re in state and what happens next depends on what action we take. To solve this, we can condition on the action.
When we condition on a random variable, what we’re essentially saying is I don't know the value yet, so let me consider all possible values it could take, weighted by how likely each value is.
So, let’s condition on action .
and as we saw above in the start during setup, is just a definition of policy .
Hence, we can simply the equation to:
Similarly, we the future return can be simplified to:
Let’s substitute both of these back into our value function:
Which we can combine and simplify to:
Step 4: Condition on Next State and Reward
Now comes another challenge. We now know state and action , but we still don’t know what reward will we get and what next state will we land in.
However, we can condition on next state and reward just like what we did in above step. These are determined by the environment dynamics which gives the probability of ending up in next state with a reward when we take an action in state .
We’ll apply the law of total expectation again and condition on and .
We can simplify the notation and write the above equation as:
This means we sum over all possible next states and rewards , weighted by the probability of getting that pair when taking action from state .
Now, we’re conditioning on , which means R_{t+1} is no longer a random variable and it has a specific value . If we know a random variable equals a specific value, it’s expected value is just that value.
So, we can write
as
Step 5: Apply Markov Property
Markov property states that the future depends only on the current state, not on the history. What this means for us is, given , the future return is independent of , , . This leads us to
The expected future return starting from time , given we’re in state , doesn’t depend on how we got to . So substitute this back
Step 6: Recursive Nature
Look at this term:
This is similar to asking what’s the expected return starting from state following policy . And if you recall, that’s exactly the definition of .
The question arises, why we can do this? It’s because the value function doesn’t depend on time. It only depends on the state. Whether we’re computing expected return from state at time or time we get the same answer .
Therefore, we it’s fair to write
This is a key recursive step. We’ve expressed the value of the current state in terms of the value of the next state.
Step 7: Collect All the Pieces
From step 4-6, we saw:
Now, let’s put that back into our equation from step 3:
which becomes
and this is the Bellman Equation.
This equation does averaging at three levels. We average over actions, next states and rewards and each averaging weights by their appropriate probability.
Final Thoughts
I’ve tried my best to explain this equation and was explicit about some of the math fundamentals. I’d love to hear from you if you’re enjoy RL and math behind it.
Please reach out to me if I’ve made any wrong conclusions in the explanation. Would love to learn more.