Understanding Bellman Equation

10 Jan, 2026

I’m going through reinforcement learning book by Sutton to get a better understanding of how RL works and build the intuition behind math that powers it.

In a previous post I wrote about how I’ve started brushing up my math fundamentals and it gave me confidence to pick up this book, tackle theory and math and push myself even further.

When I came across this equation, I understood what it was doing at a high level but couldn’t build up reasoning and intuition behind some of the operations that were happening along the way.

It took me roughly 2 weeks before it finally clicked. I had to do multiple readings, going back to fundamentals of each concept that were used implicitly by the author and writing it down numerous times.

This post is an attempt to derive this beautiful equation and what better way to write about this.

What is a Bellman Equation?

The Bellman equation transforms an intractable “look at all possible futures” into a tractable “look one step ahead and just trust your value estimates” problem. This makes RL computationally feasible.

v_{π} (s) = \sum_{a} π (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})]

Yes, it looks weird first and that’s what I felt. But we’re going to unpack this slowly and I hope you’d have a better understanding by the end of it. Buckle up!

Math GIFfrom Math GIFs

Basic Setup and Definitions

Markov Decision Process

An MDP consists of:

States $S$ : All possible situations the agent can be in (e.g. positions on a grid)
Actions $A$ : All possible moves the agent can make (e.g. up, down, left or right)
Transition Dynamics $p (s^{'}, r ∣ s, a)$ : Probability of landing in state $s^{'}$ and getting reward $r$ when taking an action $a$ from state $s$
Discount Factor $γ \in [0, 1]$ : How much we value future rewards vs. immediate rewards

Policy

A policy $π$ is a function that tells the agent what to do in each state. The probability of taking an action $a$ in state $s$ is defined as $π (a ∣ s)$ .

Return

The return $G_{t}$ is the total discounted return from time $t$ onward:

G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + γ^{3} R_{t + 4} + . . .

Let’s unpack this notation:

$R_{t + 1}$ : Reward received at the next time step
$γ R_{t + 2}$ : Reward two steps ahead, discounted by $γ$
an so on forever…

We apply discount $γ$ here to define how we much we care about immediate or future rewards. If $γ$ is 0, we only care about immediate reward. If it is 1, all future rewards matter equally. If it’s between 0 and 1, it means future rewards matter but less than immediate ones.

State Value Function

This function tells us the value of a state $s$ if we follow a policy $π$ . It’s defined as:

v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]

Let’s also unpack this:

$v_{π} (s)$ : Value of a state $s$ under policy $π$
$E_{π} [.]$ : Expected value when following policy $π$
$G_{t}$ : Return (total discounted reward)
$S_{t}$ : We’re in state $s$ at time $t$

State value functions are useful because if we know $v_{π} (s)$ for all states, we can compare which are better to be in.

Recursive Property of Returns

This is one of the most important insights that makes the Bellman equation possible. We know that the total discount return is defined as:

G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + γ^{3} R_{t + 4} + . . .

Let’s factor out $γ$ from everything except the first term:

G_{t} = R_{t + 1} + γ (R_{t + 2} + γ R_{t + 3} + γ^{2} R_{t + 4} + . . .)

If you look carefully what’s inside the parentheses, it’s the total discount return from time $t + 1$ , which is $G_{t + 1}$ . Therefore, we can write $G_{t}$ as:

G_{t} = R_{t + 1} + γ G_{t + 1}

Why this is important is because it reduced the discounted return from time $t$ to immediate reward $R_{t + 1}$ $+$ the discounted return from time $t + 1$ .

Derivation

Now that the basics are out of our way, we’ll derive the Bellman equation step by step.

Step 0: Write the definition

Let’s start with the definition of value function. We’re in state $s$ at time $t$ . We want the expected total return from this point onward.

v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]

Step 1: Substitute the recursive formula

We just showed that $G_{t} = R_{t + 1} + γ G_{t + 1}$ , so substitute this in above function.

v_{π} (s) = E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s]

Step 2: Use Linearity of Expectation

Now, let’s expand the above equation using linearity of expectation so we can deal with individual parts.

v_{π} (s) = E_{π} [R_{t + 1} | S_{t} = s] + γ E_{π} [G_{t + 1} | S_{t} = s]

We did not do anything fancy here, just separated $R_{t + 1}$ and $G_{t + 1}$ and pulled out the constant $γ$ .

This is still a state value function which translates to (expected immediate reward) + (discount factor x expected future return).

Step 3: Condition on Action

Now, we need to compute $E_{π} [R_{t + 1} | S_{t} = s]$ and $E_{π} [G_{t + 1} | S_{t} = s]$ . But the problem right now is, we’re in state $s$ and what happens next depends on what action we take. To solve this, we can condition on the action.

When we condition on a random variable, what we’re essentially saying is I don't know the value yet, so let me consider all possible values it could take, weighted by how likely each value is.

So, let’s condition on action $A_{t} = a$ .

E_{π} [R_{t + 1} ∣ S_{t} = s] = \sum_{a} P (A_{t} = a ∣ S_{t} = s) \cdot E_{π} [R_{t + 1} ∣ S_{t} = s, A_{t} = a]

and as we saw above in the start during setup, $P (A_{t} = a ∣ S_{t} = s)$ is just a definition of policy $π (a ∣ s)$ .

Hence, we can simply the equation to:

E_{π} [R_{t + 1} ∣ S_{t} = s] = \sum_{a} π (a ∣ s) \cdot E_{π} [R_{t + 1} ∣ S_{t} = s, A_{t} = a]

Similarly, we the future return can be simplified to:

E_{π} [G_{t + 1} ∣ S_{t} = s] = \sum_{a} π (a ∣ s) \cdot E_{π} [G_{t + 1} ∣ S_{t} = s, A_{t} = a]

Let’s substitute both of these back into our value function:

v_{π} (s) = \sum_{a} π (a | s) \cdot E_{π} [R_{t + 1} | S_{t} = s, A_{t} = a] + γ \sum_{a} π (a | s) \cdot E_{π} [G_{t + 1} | S_{t} = s, A_{t} = a]

Which we can combine and simplify to:

v_{π} (s) = \sum_{a} π (a | s) \cdot E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a]

Step 4: Condition on Next State and Reward

Now comes another challenge. We now know state $s$ and action $a$ , but we still don’t know what reward $r$ will we get and what next state $s^{'}$ will we land in.

However, we can condition on next state and reward just like what we did in above step. These are determined by the environment dynamics $p (s^{'}, r ∣ s, a)$ which gives the probability of ending up in next state $s^{'}$ with a reward $r$ when we take an action $a$ in state $s$ .

We’ll apply the law of total expectation again and condition on $S_{t + 1}$ and $R_{t + 1}$ .

= \sum_{s^{'}} \sum_{r} P (S_{t + 1} = s^{'}, R_{t + 1} = r | S_{t} = s, A_{t} = a) \cdot E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}, R_{t + 1} = r]

We can simplify the notation $P (S_{t + 1} = s^{'}, R_{t + 1} = r | S_{t} = s, A_{t} = a) = p (s^{'}, r | s, a)$ and write the above equation as:

= \sum_{s^{'}} \sum_{r} p (s^{'}, r | s, a) \cdot E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}, R_{t + 1} = r]

This means we sum over all possible next states $s^{'}$ and rewards $r$ , weighted by the probability of getting that $(s ’, r)$ pair when taking action $a$ from state $s$ .

Now, we’re conditioning on $R_{t + 1} = r$ , which means R_{t+1} is no longer a random variable and it has a specific value $r$ . If we know a random variable equals a specific value, it’s expected value is just that value.

So, we can write

E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}, R_{t + 1} = r]

= r + γ E_{π} [G_{t + 1} | S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}, R_{t + 1} = r]

Step 5: Apply Markov Property

Markov property states that the future depends only on the current state, not on the history. What this means for us is, given $S_{t + 1} = s^{'}$ , the future return $G_{t + 1}$ is independent of $S_{t}$ , $A_{t}$ , $R_{t + 1}$ . This leads us to

E_{π} [G_{t + 1} | S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}, R_{t + 1} = r] = E_{π} [G_{t + 1} | S_{t + 1} = s^{'}]

The expected future return starting from time $t + 1$ , given we’re in state $s^{'}$ , doesn’t depend on how we got to $s^{'}$ . So substitute this back

= r + γ E_{π} [G_{t + 1} | S_{t + 1} = s^{'}]

Step 6: Recursive Nature

Look at this term:

E_{π} [G_{t + 1} | S_{t + 1} = s^{'}]

This is similar to asking what’s the expected return starting from state $s^{'}$ following policy $π$ . And if you recall, that’s exactly the definition of $v_{π} (s^{'})$ .

The question arises, why we can do this? It’s because the value function doesn’t depend on time. It only depends on the state. Whether we’re computing expected return from state $s^{'}$ at time $t + 1$ or time $t + 100$ we get the same answer $v_{π} (s^{'})$ .

Therefore, we it’s fair to write

r + γ E_{π} [G_{t + 1} | S_{t + 1} = s^{'}] = r + γ v_{π} (s^{'})

This is a key recursive step. We’ve expressed the value of the current state in terms of the value of the next state.

Step 7: Collect All the Pieces

From step 4-6, we saw:

E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a] = \sum_{s^{'}} \sum_{r} p (s^{'}, r | s, a) \cdot [r + γ v_{π} (s^{'})]

Now, let’s put that back into our equation from step 3:

v_{π} (s) = \sum_{a} π (a | s) \cdot E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s, A_{t} = a]

which becomes

v_{π} (s) = \sum_{a} π (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | s, a) \cdot [r + γ v_{π} (s^{'})]

and this is the Bellman Equation.

Heat GIFfrom Heat GIFs

This equation does averaging at three levels. We average over actions, next states and rewards and each averaging weights by their appropriate probability.

Final Thoughts

I’ve tried my best to explain this equation and was explicit about some of the math fundamentals. I’d love to hear from you if you’re enjoy RL and math behind it.

Please reach out to me if I’ve made any wrong conclusions in the explanation. Would love to learn more.