Policy Learning without overlap, is it possible?

Gerben Oostra
10 min readFeb 1, 2023

In offline policy learning, one typically requires ‘overlap’ (also known as positivity or common support) for both policy learning and evaluation. But is it possible to relax this constraint a little bit? A recent article, Policy Learning “without” overlap provides some ideas.

I’ll start with some background on positivity and overlap, followed by the contributions of the paper. Feel free to jump to the latter section if you’re already familiar with overlap and the related aspect of positivity.

Background on positivity and overlap

Suppose you have collected data from an existing base policy. You’d like to learn a new policy that provides better results. The data is often referred to as ‘logged bandit feedback’: a collection of independent <context, action, outcome> samples (aka units). A unit has some characteristics (the context), the action taken (aka treatment), and an observed result (the outcome). As the data is collected by observing an existing process, instead of deliberately interfering with it, it is an observational study instead of an experimental one.

For simplicity, I use a binary action that is either “to treat” or “not to treat”. I also assume that all units are completely independent if conditioned on the context. We thus have a Contextual Bandit setting instead of a sequential Reinforcement Learning setting.

Training a new policy and evaluating the quality (offline) of a policy uses the logged data. However, not all data may be suitable for training and evaluation. Let’s elaborate on that:

Overlap in policy learning.

One approach to learning a policy is first learning the treatment effect conditionally on the context. Then a policy could be derived that takes the best action or selects the action with a likelihood proportionally to the effect.

The training data needs sufficient treated and non-treated units to estimate the treatment effect accurately Not only in total but also per region in the feature space.

This means that the units the estimator uses should have a sufficient likelihood of being treated and not treated. If all similar units are always treated, or conversely never treated, the effect estimator can become biased. One thus only trains on units with a sufficient positive propensity, aka positivity.

As an example, this manifests itself in Inverse Propensity Weighting. One can use inverse propensity weighting to obtain an unbiased model when training the model. This technique inversely weights the training samples by the propensity of the observed action. If the propensities are too close to 0, their weights explode, forcing the model to over-fit on these samples. This results in high variance (in the variance-bias tradeoff of model learning). One can train on the subset of samples with sufficiently high propensities to prevent this issue. This is also known as propensity trimming, where you filter your training set based on a threshold on propensities.

One side note; with inverse propensity weighting, one can clip (winsorize) the propensities to prevent the weights from becoming too big.

This effect also occurs for other methods, like propensity matching, that try to obtain unbiased estimates. If you have a rare unit (=low propensity, a unit with an action that is very unlikely), most similar units will have the other action. When you match those other units, all of them will match to your rare unit. This will skew your data set to a single rare unit, resulting in a high variance model.

Thus you should only include the regions of your feature space where you have both treated and non-treated samples. In other words, the feature spaces should overlap. Instead of counting the frequencies, one can use the propensity score to estimate the likelihood of observing the action. If you observe an action, but its propensity is too low (like 0.001), or too high (like 0.995), then you should exclude it from your training data. You thus require positivity (positive propensities) for both actions (to treat and not to treat), resulting in an overlap of the feature spaces of treated and non-treated samples.

Overlap in policy application.

The reason to build a policy is to apply it to new units of a target population. The covariates of the units in that target population should overlap with the source population we trained the policy on. If you want to accurately predict the treatment effect of a specific action on a particular unit, your model should be able to generalize to that sample. There should have been some sufficiently similar samples in the training data of your model.

We should have some similarly treated samples for a model that predicts the outcome of being treated. Conversely, to predict the outcome of not being treated, we need to have similar non-treated samples. Combined, if we want to predict the treatment effect (the difference between being treated vs. not-treated), the inference unit should be (sufficiently) similar to both treated and non-treated samples.

In other words, we can only apply effect estimators on regions in the feature space where we have observed both treated and non-treated samples. Where the training and inference feature spaces overlap.

This relates to “positivity”: there needs to be a positive likelihood of being treated and not-treated. The propensity score can represent this. It defines the probability of observing a specific action in the training set (source population), given the context. If a propensity model is available, one can apply this to the target population and know how likely the action is. It thus specifies the probability of observing an action given the covariates. This is thus used as a proxy for the probability of observing an action with the covariates in the training set. If an action is very likely to occur, we assume it also occurred in the training set.

Note how we are comparing training set and inference set sample distributions. This thus relates to Out-of-Distribution issues. Given a trained model, you will apply it to some target population. If your target population is (very) different from your training set, the predictions might be inaccurate.

An easy way to prevent Out-of-Distribution issues in the policy learning domain is to apply Propensity Trimming on the target population. Only allow the policy to recommend actions when the propensity of being treated is not too close to 0 or 1.

Overlap in policy evaluation.

Given a trained policy, one also needs an accurate and unbiased evaluation of that policy. Therefore, one applies the policy to some labeled target population, resulting in a collection of <observed action, recommended action, observed outcome>, together with possibly the predicted outcome and the treatment propensity. Different offline policy evaluation methods (OPEs) exist to provide an estimated Policy Value.

As an example, you can just sum the predicted outcomes for the recommended action. This, however, will be biased: if your model consistently overestimates the outcomes, one gets an overestimated policy value. To get an unbiased evaluation, one can again apply inverse propensity weighting (aka Propensity Scaling) or methods like Doubly Robust or Hajeck estimators.

And this is where the article Policy Learning without overlap has something to say.

Contributions by the article.

The article “Policy Learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality” by Jin et al. introduces some interesting contributions.

The first, not very unexpected, is that to evaluate a policy correctly, positivity is only required in the policy output. This is because the policy creates recommended actions for each sample in the target population (test set). For an accurate evaluation, only the recommended actions must have a sufficiently high likelihood. Thus there only needs to be sufficient positivity, aka overlap, between your final policy and the base policy (which generated the data).

Secondly: instead of trimming low propensities from the evaluation set to enforce that positivity, one can keep these units, and adjust the policy evaluation metric to take into account their effect on the evaluation metrics variance.

To be able to do this, they introduce a mathematical estimate of the uncertainty of the policy evaluation. They use this uncertainty definition to calculate a lower bound of the policy value. The paper show how to do this for a Doubly Robust policy value estimator.

A doubly robust estimator consists of two parts. The first part is a direct estimate of the outcome. As second part, it is combined with a propensity-scaled error, that adjusts the value using the error for known outcomes (for the agreement set between the evaluated policy and the base policy):

The policy value estimator. Formula 5 in “Policy Learning ‘without’ overlap” by Y. Jin, Z.Ren, Z. Yang et al.

Q is the policy value, ^ means its estimator, T the number of samples, µ the outcome given context and action, X the context, π the policy function that defines action given context, A the observed action, e the likelihood of a certain action for a certain context, H the logged data, Y the observed outcome.

This estimator will get a lot of variance if the propensity (the denominator in the second part) is too low. The estimate is uncertain for low overlap (little positivity, small propensity values). It doesn’t have to be incorrect, but with small perturbations of model output or sample selections, one can obtain very different model estimates. The value estimate you get can be far from the actual value, even though the expected value is close to the actual value. Thus if you greedily compare policies based on the point estimate of their value, you probably select a worse policy with high variance. As shown in the following picture:

Two policies with different overlap with the base policy. Therefore the evaluation method will have different variances. Your point estimate will be a random draw from these distributions. If you take the point estimate (Q-hat) of many policies, the ones with high variance are more likely to have a high point estimate. Figure 3 in “Policy Learning ‘without’ overlap” by Y. Jin, Z.Ren, Z. Yang et al.

Thirdly, following the above explanation, their main idea is to compare policies based on a lower bound estimate of the policy value instead of using the point estimate. Given the Doubly Robust evaluation method, they show that the deviation consists of 3 aspects: intrinsic uncertainty, an optimization error, and a greedy uncertainty:

Formula 2 in “Policy learning ‘without’ overlap’ by Yin et al.

This calculates the difference in actual policy value (Q) between the best policy π* and the estimated best policy π^, which consists of the difference between policy value and policy value estimator, the difference between the best and estimated best policy, and the difference caused by selecting the policy based on the policy value estimator.

Each of these scales linearly (with an additional factor related to the policy class complexity) with:

Formula 7 in “Policy learning ‘without’ overlap by Y. Jin, Z.Ren, Z.Yang et al.

The maximum of these deviations is then subtracted from the policy value as a regularization term. Details on determining a reasonable scale based on the policy class complexity are in the paper.

Note that the denominator is the treatment likelihood of the base policy for the recommended treatment. Only the sample deviation is limited to the agreement set (where the recommended treatment is equal to the observed treatment).

Main takeaways from the paper:

Thus in short:

  • Because of inherent variance in your policy evaluation function, high variant (worse) policies are likely to seem best. Use a lower bound of the metric to compare and select policies.
  • There is a mathematical definition for Doubly Robust’s variance (but it assumes exactly known values of the propensity).
  • Instead of a complete overlap of treated and non-treated propensities, there only needs to be an overlap between candidate policy decisions and logged decisions.
  • Instead of trimming low propensities from the evaluation set, one can include these while considering their effect on the evaluation metrics variance.

My take on the approach

Regularization perspective

The uncertainty offset used to obtain the lower bound can be regarded as a regularization term. In this case, a regularization term that punishes uncertainty.

To obtain the right scale of regularization, the paper shows how to determine the multiplier based on the model complexity.

In other machine learning settings, the regularization scale is typically selected using a cross-validation approach that evaluates the metric on the holdout folds. In those cases the regularization term is part of the model objective, but not part of the final evaluation metric. In the approach from the paper, the regularization term is part of the final metric, preventing us from optimizing is scale. The (uncertainty) scale thus needs to be derived differently, which is explained in the paper.

Uncertainty scale

As determining the scale is not trivial, an alternative approach would be to use a numerical approximation of the metric’s lower bound. Such a bootstrap approach would give various point estimates of the policy value. You can then take a certain lower quantile from these sampled evaluation values.

The problem of such an approach is that one ideally searches through a range of candidate policies. Perhaps even optimize the policy directly using the regularized loss function. Unfortunately, bootstrapping in each step will be computationally too expensive, and also prevents a direct optimization of the policy with gradient descent.

Propensity estimation

One side note is that the paper assumes the base policy is known. Thus the likelihood of both the observed and the recommended action are always known exactly.

If you don’t know the base policy, you need an estimator for the treatment propensity. Because this estimator will have some uncertainty of itself, this will introduce an additional source of variance.

Deterministic policy

Also note that the deviation formulas assume a deterministic candidate policy. The recommended action is either to treat or not to treat, not a probability of treatment. Of course, one could simulate what a policy would actually chose, by sampling from such the policy’s output distribution, and evaluate with that. However, this will require some bootstrapping to determine the range of possible policy values.


  • Policy learning “without’’ overlap: Pessimism and generalized empirical Bernstein’s inequality, by Y. Jin, Z.Ren, Z.Yang et al. : http://arxiv.org/abs/2212.09900