Unbiased Estimators for Mean and Variance
Unbiased estimators for mean and variance, along with proofs
- : Changing image and file paths
When we are presented with data, we often want to try to get some sort of grasp on how it is shaped. To this end, we would like to be able to estimate the mean and variance (or standard deviation). Let’s dive in to these estimators, and look at proofs that they are unbiased!
Estimators
When we are examining a population, we are often interested in parameters - that is, things that are true about the entire population. For instance, we might care about the average length of a population of snakes or the variance in the lifespan of a population of mice. At least from a frequentist perspectives, both of these are actual quantities which we could, with unlimited time and energy, determine. However, instead of the labourious process of finding population parameters, we use an estimator to get close. An estimator is something which we can calculate as a proxy for the actual parameter of interest.
Of course, we would like our estimators to eventually get close to the actual parameter. Ideally, we would like an unbiased estimator. Like the name suggests, an estimator of the parameter is unbiased if . Remember also that the expected value of a random variable is , where are the outcomes (values which could take on) and is the probability that the outcome occurs. Of course, if the distribution from which is drawn is continuous, then it would be an integral instead, but for the purpose of today’s proofs we’ll focus on discrete distributions.
An Estimator for the Mean
Assume that you have drawn a sample , , , all independent, from a distribution with mean and variance . Then an unbiased estimator of the mean is the sample mean .
Proof:
And so we have it.
An Estimator for the Variance
Finding an unbiased estimator for the variance is substantially more complicated than for the mean. First, a reminder: the variance of a random variable is , where again the are the possible values which can take on. Naively, we would imagine that with our same sample drawn from a distribution with mean and variance , an unbiased estimator would simply be . However, it turns out that is not unbiased! The correct (or at least a correct) unbiased estimator is called the sample variance, . The proof that this is unbiased relies on a few other facts - let’s derive these!
Proof:
We will also need the following:
Proof
Now that we have those two fact all settled, we can prove that the sample variance is an unbiased estimator of the variance! That means that the expected value of the estimator should be the target parameter, .
So, assume that you have some distribution with mean and variance , and that we draw observations , all independent, from this distribution, and from it we calculate the sample variance . Then:
And there we have it! The sample variance is an unbiased estimator for the variance.
Implementations for the Sample Variance: JavaScript, Python, R
Now that we’ve looked at the mathematics behind this, let’s look at how we can write a function for the sample variance in JavaScript, Python, and R:
In JavaScript
function calculateSampleVariance(numbers) {
const n = numbers.length;
const xBar = numbers.reduce((sum, current) => sum + current, 0) / n;
const sampleVariance =
numbers.reduce((sum, current) => sum + (current - xBar) ** 2, 0) /
(n - 1);
return sampleVariance;
}
console.log(
`Sample variance of [1, 1, 1, 1, 1]: ${calculateSampleVariance([
1, 1, 1, 1, 1,
])}`
);
console.log(
`Sample variance of [1, 2, 3, 4, 5]: ${calculateSampleVariance([
1, 2, 3, 4, 5,
])}`
);
Output:
Sample variance of [1, 1, 1, 1, 1]: 0
Sample variance of [1, 2, 3, 4, 5]: 2.5
In Python
def calculate_sample_variance(numbers: list) -> float:
""" Calculate the sample variance """
n = len(numbers)
x_bar = sum(numbers) / n
sample_variance = sum(map(lambda x: (x - x_bar) ** 2, numbers)) / (n - 1)
return sample_variance
print(
f"Sample variance of the set [1, 1, 1, 1, 1]: {calculate_sample_variance([1, 1, 1, 1, 1])}"
)
print(
f"Sample variance of the set [1, 2, 3, 4, 5]: {calculate_sample_variance([1, 2, 3, 4, 5])}"
)
Output:
Sample variance of the set [1, 1, 1, 1, 1]: 0.0
Sample variance of the set [1, 2, 3, 4, 5]: 2.5
In R
calculateSampleVariance <- function(numbers) {
n <- length(numbers)
xBar <- sum(numbers) / n
sampleVariance <- sum( ( numbers - xBar ) ** 2 ) / (n - 1)
return(sampleVariance)
}
sprintf("Sample variance of [1, 1, 1, 1, 1]: %.1f", calculateSampleVariance(c(1, 1, 1, 1, 1)))
sprintf("Sample variance of [1, 2, 3, 4, 5]: %.1f", calculateSampleVariance(c(1, 2, 3, 4, 5)))
Output:
[1] "Sample variance of [1, 1, 1, 1, 1]: 0.0"
[1] "Sample variance of [1, 2, 3, 4, 5]: 2.5"
Conclusions
Above, I’ve presented unbiased estimators for the mean and variance of a distribution. With these, given independent samples from a distribution, we can provide point estimates for these parameters. However, there are some questions left to ask and things to note. For instance, we might assume that since the sample variance is an unbiased estimator of the variance, would be an unbiased estimator of the standard deviation. However, this is not the case.
A more pressing issue is how much we should believe the values that we get. That is, imagine that we take a sample of size 10 and get a sample mean of and sample variance . Then we take a much larger sample of size 100 and get the exact same results! Obviously we expect that the larger sample will be more likely to be correct, but the question is - how much more should we believe the larger sample?
Perhaps these will be ideas for later. For now, have a great day!
Sources
- This StackOverflow thread was particularly helpful. In particular, the proof that I’ve presented is largely a combination of the two most voted-for answers, expressed in a way that made sense to me.
- This site confirmed my foggy recollection of the definition of an unbiased estimator.