For a sample observation we have already defined the predicted value: . Often we may use the OLS estimates to predict the value of Y for a a value of X that is not in the sample. Furthermore, we will want to know how precise our prediction for the value of Y is, whether in sample or out of sample.
Let us then consider the problem of predicting the value of Y for an arbitrary value of X denoted . This value of X may or may not be in the sample used to estimate the regression equation. We can actually think of two numbers we would want to predict for :
Notice the difference between the two. One is the actual value of Y for
a particular observation, including that observations error term
The second object is the expected value of Y conditional on knowing the
. For example, suppose you are a criminologist who has
estimated the following regression:
using data on city sizes and crime rates. You obtain estimates and and then wish to make predictions about crime rates for cities that you do not know the crime rate for. You might want to predict the crime rate for a particular city, say Toronto, whose population would give it a value . That would be a an individual prediction. On the other hand, you may want to know what your model predicts is the crime on average in cities of size 3.2 million. That would be a mean prediction. In effect, you want to average out the effect of the disturbance terms .
The OLS prediction is the same for both mean and
The predictions are the same because the expected value of (that is, the disturbance term for a particular observation such as Toronto) is 0. So one would use the OLS regression line to predict out of sample as well as in sample. We can think of the difference the following way:
This equation suggests that the difference between the predictions lies in their variance. The precision of an individual prediction is lower because the variance of the disturbance must be taken into account.
Notice that variance increases the farther the value of is from the sample mean of X. It is at the sample mean of X that we have the most information about the relationship between X and Y. As we move away from that point, the less information we have and the more unsure we are of the location of the population regression line.
Since we assume that the disturbance term is unrelated to the value of
, we can see that
Of course we can't directly compute the variance in our predictions, because they depend upon the value of . As usual, we can only compute the estimated variance and standard deviation of the prediction.
Note that under A7 both predictions are normally distributed, so that we can compute confidence intervals and perform hypotheses tests on actual values of Y and E[Y].
Once we know the formula for the variance of the prediction we are
making, the formula a confidence interval for the prediction is the same
To compute the confidence interval by hand using only the regression output requires five numbers:
5 Pieces of Information to Compute Confidence Intervals for Predictions