Epsilon: Underrated Adam Parameter

When we talk about the Adam optimizer, most of the attention goes to its well-known features: momentum, adaptive learning rates, and its efficiency in training deep models. But there is one tiny parameter in the update rule that rarely gets the spotlight: epsilon (ε).

Let’s take a closer look at what epsilon does, and why it might deserve more respect than it usually gets.

Where Does Epsilon Show Up?

If you look at the original Adam paper, epsilon appears in the update equation like this:

\theta_t \leftarrow \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

This equation is the heart of Adam. Here:

$\alpha$ is the learning rate
$\hat{m}_t$ is the bias-corrected first moment estimate (momentum)
$\hat{v}_t$ is the bias-corrected second moment estimate (variance)
and $\epsilon$ is the small constant in the denominator

At a glance, epsilon seems like just a safety net. It prevents division by zero when $\hat{v}_t$ is very small. But this little constant actually plays a much bigger role than just guarding against numerical issues.

Epsilon is a Bias Term

In practice, epsilon acts as a bias. It controls how much the optimizer trusts the per-parameter variance $\hat{v}_t$ when computing the step size.

Think of it this way:

A larger epsilon reduces the effect of the adaptive part. It smooths out differences between parameter learning rates. This makes Adam behave more like vanilla SGD.
A smaller epsilon allows learning rates to vary more between parameters. This is where Adam gets its adaptivity.

So epsilon acts like a regulator. If it is too large, you lose adaptivity. If it is too small, training may become unstable, especially early on.

Why Use a Larger Epsilon?

In some situations, adding bias is actually helpful.

Take Reinforcement Learning (RL) for example. The learning target is constantly changing. The value function or policy is updated continuously as the agent explores and learns. That means the best learning rate for a parameter today might be completely wrong tomorrow.

If epsilon is too small, Adam may overfit to each parameter's recent history, which can become misleading quickly.

Using a larger epsilon smooths out learning rates across parameters. This tells the optimizer not to trust recent variance estimates too much. It helps prevent erratic behavior and stabilizes learning when the environment is unpredictable.

So in non-stationary tasks like RL, a larger epsilon can actually lead to better and more stable optimization.

Why Use a Smaller Epsilon?

On the other hand, if you are working with a more stable or predictable objective, you may want to reduce epsilon.

A smaller epsilon allows each parameter to get a learning rate that is more closely tuned to its own behavior. This can lead to faster convergence and more efficient learning.

For example, in the MiniMax-M1 paper¹, the authors observed that the gradient magnitudes in training could range from $10^{-18}$ to $10^{-5}$ , with most gradients smaller than $10^{-14}$ , and weak correlation across steps. Using the default epsilon of $10^{-8}$ in this setting drowned out the fine detail in small gradients. As a result, they set $\beta_1 = 0.9$ , $\beta_2 = 0.95$ , and lowered epsilon to $10^{-15}$ to better reflect the scale of the data and improve optimization.

This highlights a broader point: if you are working with extremely small gradients or sensitive dynamics, a smaller epsilon can help make the adaptive learning rate mechanism actually respond to what matters.

Epsilon Controls Learning Rate Variance

Here is another way to look at it:

Small $\epsilon$ leads to high variance in learning rates between parameters
Large $\epsilon$ leads to low variance, more uniform learning rates

This can be important in settings where the learning trajectory is noisy or unstable. A well-tuned epsilon can help reduce learning rate volatility and make training more robust.

What Experiments Show

If you want evidence that epsilon actually matters, check out the RAdam paper², especially Section 3.1.

The authors define a baseline called Adam-eps. They simply increase the value of epsilon to give it more weight in the denominator. This small change reduces variance during the warm-up phase of training. It leads to more stable updates before Adam has a good estimate of gradient statistics.

In Figure 3 of the paper, you can see that this approach reduces variance and helps early training. However, just increasing epsilon is not always the answer. A large epsilon adds bias, which can slow down optimization later on.

So as always, there is a tradeoff.

Final Thoughts

Epsilon might be the smallest hyperparameter in Adam, but it has a big impact.

It is not just a trick to avoid division by zero. It is a control parameter that helps balance stability and adaptivity. In stable environments, a small epsilon can help you converge faster. In dynamic or noisy environments, a larger epsilon can prevent instability.

Next time you tune Adam, don’t ignore epsilon. It might be the small tweak that makes your optimizer behave smarter.