Queens University at Kingston

HyperMetricsNotes

regress File 2
[regress Contents] [Previous File] [Next File]


B. Definition of the Simple Linear Regression Model (LRM)


  • The LRM consists of five components:
    1. The nature of the data to be study with the LRM.
    2. The Population Regression Equation (or PRE)
    3. Specification of the PRE
    4. Assumptions about the observed variables in the PRE
    5. Assumptions about the unobserved disturbance term in the PRE

    1. The nature of the data to be study with the LRM
    2. The simple LRM model is designed to study the relationship between a pair of variables that appear in a data set. The multiple LRM, which we will study later, is designed to study the relationship between one variable and several of other variables. In both cases, the sample is considered a random sample from some population. The two variables, X and Y, are two measured outcomes for each observation in the data set. For example, in our minimum wage example, one variable would be the employment rate in a particular location (say a province) at a particular point in time (say January 1995) within a particular group of people (say people between 16 and 20 years of age). The other variable would be the minimum wage at the same location and time.

      Assumption A0 about the LRM

      The data consist of N pairs of two related variables:
      $$\hbox{Sample} \equiv \biggl[ (X_1,Y_1), (X_2,Y_2), \dots (X_N,Y_N) \biggr].$$

    3. The Population Regression Equation (PRE)
    4. The PRE is sometimes called the Population Regression Function (PRF). The PRE is the model specified by the researcher for studying the relationship between X and Y.

      Assumption A1

      The variables X and Y are related by:
      $$Y = \beta_1 + \beta_2 X + u \eqno{(*)}$$
      where
      1. Y is an observed random variable (also called the endogenous variable, the left-hand side variable).
      2. X is an observed non-random or conditioning variable (also called the exogenous or right-hand side variable).
      3. $\beta_1$ is an unknown population parameter, known as the constant or intercept term.
      4. $\beta_2$ is an unkonwn population parameter, known as the coefficient or slope parameter.
      5. u is is an unobserved random variable, known as the disturbance or error term.

      An equivalent way to write down the PRE is observation by observation:
      $$\eqalign{ Y_1 &= \beta_1 + \beta_2 X_1 + u_1\cr Y_2 &= \beta_1 + \beta_2 X_2 + u_2\cr &\vdots\cr Y_N &= \beta_1 + \beta_2 X_N + u_N\cr}$$
      Since A0 assumes that each observation is drawn from the same population this way of writing the PRE is equivalent to (*). We will often refer to an arbitrary observation with the index $i$ .

    5. Specification of the PRE
    6. The linear equation (*) might at first appear to be very restrictive. That is, we start out our analysis assuming that X and Y are linearly related. Of course, many relationships are non-linear. Does that mean that the LRM cannot deal with them? Not necessarily.

      Here is what appears to be a non-linear model:
      $$Y = \beta_1 + \beta_2 (1/X) + u \eqno{(E1)}$$
      But, we could re-define the exogenous variable:
      $$Z \equiv 1/X$$
      Now (E1) can be written
      $$Y = \beta_1 + \beta_2 Z + u \eqno{(E2)}$$

      In other words, E1 can be remapped into the PRE (E2) through a transformation of the exogenous variable X. The key is that we can do the re-mapping $Z = 1/X$ without knowing the values of the population parameters $\beta_1$ and $\beta_2$ .

      Here is another model that can be re-mapped in the LRM:
      $$ Y = \beta_1 X^{\beta_2} e^u\eqno{(E3)}$$
      Why would we start out with such an equation? There are many reasons. For one thing, we may have an economic theory that tells us that X and Y should be exponentially related. For another thing, we may be dealing with variables that can only take on positive values. For example, the minimum wage and the unemployment rate are never negative numbers. A direct linear relationship between them allows the possibility that one could generate negative predicted values from the statistical analysis. In the case of the unemployment rate, this would be nonsense. By starting with E3 we guarantee that our model generates positive values.

      We can take logs of both sides of E3:
      $$\ln Y = \ln \beta_1 + \beta_2 \ln X + u\eqno{(E4)}$$
      We have to keep in mind that the intercept in E4 is the natural log of the original coefficient $\beta_1$ . We could re-define our variables and parameters:
      $$Y^\star = \beta_1^\star + \beta_2 X^\star + u	\eqno{(E5)}$$

      where
      $Y^\star$ $\equiv$ $\ln Y$
      $\beta_1^\star$ $\equiv$ $\ln\beta_1$
      $X^\star$ $\equiv$ $\ln X$
      For the obvious reason, Equation (E5) is called the double-log specification. Given our original data, we create new variables X* and Y*, and then our original model maps into a LRM on the new variables.

      The semi-log specification is:
      $$\ln Y = \beta_1 + \beta_2 X_2 + u$$

      Here is a model that can't be re-mapped into the LRM:
      $$ Y = \beta_1 + X^{\beta_2}+ u \eqno{(E6)}$$
      There is no way to define functions of X and Y that fit into the LRM without knowing the value of $\beta_2$ . But the whole idea of regression is to estimate $\beta_1$ and $\beta_2$ from data without knowing the value of $\beta_2$ beforehand. E6 is a model which would have to be estimated using non-linear regression techniques. We will have examples of particular non-linear models later in the term. How to choose the specification of the PRE is an important topic. In this class, however, we do not focus on the question. We will instead focus on the simpler question: given a specification of the PRE, what do we do?

      Once a researcher has specified the PRE, there are two types of statistical procedures that can be performed on the PRE:

      1. Estimation:

        How do we get a "good" estimates of $\beta_1$ and $\beta_2$ ? What assumptions about the PRE make a given estimator a good one?

      2. Inference:

        What can we infer about $\beta_1$ and $\beta_2$ from sample information? That is, how do we form confidence intervals for $\beta_1$ and $\beta_2$ and/or test hypotheses about them.

      The answer to these questions depend crucially upon the other elements of the LRM - the assumptions about the variables.
      [regress Contents] [Next File] [Top of File]

      This document was created using HTX, a (HTML/TeX) interlacing program written by Chris Ferrall.
      Document Last revised: 1997/1/5