stats File 4

HyperMetricsNotes

stats File 4
[stats Contents] [Previous File] [Next File]

D. Why are opinion polls accurate to 3.1 percentage points?

Concepts

confidence intervals] [confidence level]

Source: Reading the newspaper

Setup

A common statement in the media is,

The poll is based on a sample of 1000 Canadians and is considered accurate to plus or minus 3.1 percentage points.

In fact, you will see that same 3.1 percentage point accuracy reported for all polls based on samples of 1000.

Question

Where does this number 3.1 come from?
What does it mean?
Why is it the same for all polls of size 1000?

Answer: Preliminaries

Opinion polls are usually based on yes/no questions. For example,: Q: Do you think Jerry Garcia was a good role model?
Define a random variable:: Z = 0 if person says "no, Jerry Garcia was not a good role model"; Z = 1 if person says "yes, ..."
The distribution of Z is determined completely by the the proportion of people for whom Z=1.
Define
Define: P(Z=1) = p
Note:: E[Z] = 0*P(Z=0) + 1*P(Z=1) = 0+p = p
So p is the mean of the random variable Z .
Var[Z] = (0-p)² * P(Z=0) + (1-p)²*P(Z=1)=p²(1-p)+(1-p)² p = p(1-p)
The proportion of people in the survey that say "yes" is an estimate of p: $\hat p = \sum_{i=1}^{1000} Z_i$
We know in a sample of 1000 that $Var(\hat p) = {p(1-p)\over 999}$
This variance is an unknown population parameter since it depends upon p, and p is an unknown population parameter. At best we can only estimate the variance of p^. A reasonable estimate is $\hat Var(\hat p) = {\hat p(1-\hat p) \over 999}$
We define the standard deviation of p hat as its standard error. The estimated standard error is therefore $\hat{se}(\hat p) = \sqrt{\hat{Var}(\hat p)}$
We know from the Central Limit Theorem and the defintion of the t distribution that for large samples, $t = {\hat p - p \over \hat{se}(\hat p)}$
follows the $t_{999}$ distribution.

Answer: Where does 3.1 come from?

"19 out of 20" = 19/20 = 0.95
It follows that $t^\star_{0.95, 9999} = 1.96$
3.1 percentage points equals 0.031
So 0.031 must be the radius of the 95 percent confidence interval for the population parameter p. The newspaper seems to be saying: $\hat{se}(\hat p) t^\star_{0.95,999} = \hat{se}(\hat p) 1.96 = 0.031$
BUt clearly the estimated standard error of p hat depends upon the value of p hat (from the defintion above). Why doesn't it differ from poll to poll based on the number of yeses and nos?
If newspapers don't want to overstate the acccuracy of the results, then they might be reporting the confidence interval for the worst case (half say yes, half say no). If so, $\hat{se}(\hat p) = \sqrt{0.25/999} \approx 0.158193$ and 0.158193 * 1.96 = 0.03100583
or almost exactly 3.1 percentage points

Answer to: What does 3.1 mean?

Discussion

Sometimes answers to polling questions are not yes/no, but things like drinks per week or pounds overweight. In such cases, the distribution of the random variableis not summarized by one parameter like p. That means that the statement "within 3.1 percentage points" makes no sense for these kinds of answers. One would have to be told the variance of the answers as well to determine the confidence level for the mean answer.

Example from the cigarette data


* "ci" computes confidence interval for the MEAN of a variable

. ci cigs, by(dvsex) level(90)

->  dvsex=men   
Variable |     Obs         Mean    Std. Err.       [90 Conf. Interval]
---------+-------------------------------------------------------------
    cigs |     300       6.5333      0.6121          5.5234      7.5433

->  dvsex=women   
Variable |     Obs         Mean    Std. Err.       [90 Conf. Interval]
---------+-------------------------------------------------------------
    cigs |     300       4.8933      0.4959          4.0750      5.7116


*    Notice that the standard error (or standard deviation) of 
*    the sample mean is not the same as the std. dev. of the variable
*    itself (i.e. 0.6121 not equal to 10.6017).  We saw in class that 
*    the st.dev. of the sample mean of X is sqrt(Var(X)/N), where N is the
*    sample size.  

*    Does Stata use the same formula that we do to calculate confidence
*    intervals?  Use display (or "di" for short) to find out:

. di 6.5333+invt(299,.90)*0.6121
7.543244

*    Notice 7.543244 is (to 3 digits) the upper bound of the 90 percent interval
*    reported by ci.
*    Also notice the use of the invt function to look up values of the 
*     t distribution.  Get help with invt in the tutorial for week 2

This document was created using HTX, a (HTML/TeX) interlacing program written by Chris Ferrall. Document Last revised: 1997/1/5

End of Document stats

[stats Contents] [Previous File]