Missing At Random (MAR)

After considering MCAR, a second question naturally arises. That is, what are the most general conditions under which a valid analysis can be done using only the observed data, and no information about the missing value mechanism, Pr(r | yo, ym)?

The answer to this is when, given the observed data, the missingness mechanism does not depend on the unobserved data. Mathematically,


Pr(ryoym) = Pr(ryo).


This is termed Missing At Random, abbreviated MAR.

This is equivalent to saying that the behaviour of two units who share observed values have the same statistical behaviour on the other observations, whether observed or not.

For example:

As units 1 and 2 have the same values where both are observed, given these observed values, under MAR, variables 3, 5 and 6 from unit 2 have the same distribution (NB not the same value!) as variables 3, 5 and 6 from unit 1.

Note that under MAR the probability of a value being missing will generally depend on observed values, so it does not correspond to the intuitive notion of 'random'. The important idea is that the missing value mechanism can expressed solely in terms of observations that are observed.

Unfortunately, this can rarely be definitively determined from the data at hand!

Examples of MAR mechanisms

  • A subject may be removed from a trial if his/her condition is not controlled sufficiently well (according to pre-defined criteria on the response).
  • Two measurements of the same variable are made at the same time. If they differ by more than a given amount a third is taken. This third measurement is missing for those that do not differ by the given amount.

A special case of MAR is uniform non-response within classes. For example, suppose we seek to collect data on income and property tax band. Typically, those with higher incomes may be less willing to reveal them. Thus, a simple average of incomes from respondents will be downwardly biased.

However, now suppose we have everyone's property tax band, and given property tax band non-response to the income question is random. Then, the income data is missing at random; the reason, or mechanism, for it being missing depends on property band. Given property band, missingness does not depend on income itself.

Therefore, to get an unbiased estimate of income, we first average the observed income within each property band. As data are missing at random given property band, these estimates will be valid. To get an estimate of the overall income, we simply combine these estimates, weighting by the proportion in each property band.

In this example, a simple summary statistic (average of observed incomes) was biased. Conversely, a simple model (estimate of income conditional on property band), where we condition on the variable that makes the data MAR, led to a valid result.

This is an example of a more general result. Methods based on the likelihood are valid under MAR. However, in general non-likelihood methods (e.g. based on completers, moments, estimating equations & including generalised estimating equations) are not valid under MAR, although some can be 'fixed up'. In particular, ordinary means, and other simple summary statistics from observed data, will be biased.

Finally, note that in a likelihood setting the term ignorable is often used to refer to and MAR mechanism. It is the mechanism (i.e. the model for Pr(Ryo)) which is ignorable - not the missing data!