USING REGRESSION

 


USING REGRESSION ANALYSIS

by Doug Korty

                  The greatest disorder of the mind is to let will direct belief.     Louis Pasteur


CONFIGURAL THINKING      

Herbert Simon won a Nobel Prize in Economics for his work in management science.  One of his accomplishments was to show that many decisions require the analysis of complex, simultaneous relationships but that most human thinking is linear and, therefore, often inadequate.  Only very rare people, such as chess masters, are good at the "configural thinking" that would be necessary. When Simon gave 'experts' - doctors, psychologists, engineers, etc. - complex problems to solve, they generally did worse than amateurs.  Giving these experts more information about a problem improved their confidence but not their performance.

Because of their cognitive limitations, most people, including experts, are forced to simplify complex problems.  Unfortunately, preconception or prejudice often directs this simplification. We tend, for example, to look only for evidence that supports our conclusions.  Prejudice, rather than being a minor problem, turns out to be the major obstacle to problem solving.  Experts do worse than amateurs as problem solvers because their preconceptions have become more well established.  One of the great Russian chess masters said that he learned many of his best tactics from watching small children play chess because they were free from preconceptions about how chess "should be played."  Karl Popper, the great philosopher of science, wrote that to be scientific we must always state clearly what would constitute evidence against our thesis and honestly search for that evidence.  Few people do this.

Regression analysis provides a useful and understandable tool for analyzing complex, simultaneous relationships.  Regression is an extension of correlation analysis and although correlation does not prove causation, it does provide evidence for the possibility of causation.  Correlation is one important test of causality; but there are many examples of spurious correlation (such as the famous correlation between the stork population and the human birth rate) and many ways in which true correlation can be masked.  (E.g., vitamin use might improve health but might appear to be inversely correlated with health status because people with health problems take more vitamins than healthy people do.) 

Regression provides a method for both simplification and description.  Regression allows us to specify and quantify relationships.  It is also a tool for exclusion or the important 'process of elimination', finding out what relationships do not exist.  If a correlation involves a time lag, this is possible evidence for causation especially if the lag makes sense.  It also argues against the possibility of causation in the opposite direction.  (If monetary cycles reliably lead economic cycles by six months, it is hard to argue that economic events cause monetary, the reverse is more likely.)

Parsimony is an important principle in science.  The law of parsimony or "Occam's razor" states that when there are two or more competing explanations, you should favor the simplest.  With regression, adding another explanatory variable will often improve the "raw" R squared, the measure of explanatory power.  However, the additional variable may also lower the adjusted R squared (adjusted for number of variables).  If it does, it will lower our confidence in the model by lowering the significance scores and increasing the confidence interval.  Even when adding a variable does improve the fit of the model, the variable is often not justified and may not work on sub-periods or sub-samples or other periods or samples.  Using more than three or four explanatory variables in multivariate regression is rarely justified.    

Covariability, where the influence of one factor depends on other factors, is very common and is one important reason why we need "configural thinking."  Multiple regression and partial correlation analysis can be used to examine covariability by comparing the effects of using different combinations of explanatory variables.  Interaction terms (e.g. a*b, a/b, a+b, a-b, a^b) may be useful to model covariability. 


SELECTING EXPLANATORY VARIABLES      The criteria for selecting explanatory variables are that they be justifiable factors independent of each other (not correlated).  We often think that we know what the likely explanatory variables are, but it is important to go beyond our preconceptions and search for possible variables with an open mind.  We need to look critically at variable definitions and reconsider our habits of mind. For example GNP and GDP are now commonly used economic terms; however, few people understand their complicated construction and many have challenged their accuracy or value as measures of economic output.  A simpler variable, such as industrial production, might prove to be superior.

The exact variable used is sometimes unimportant; particular variables are often proxies for other variables and many variables may be highly correlated with others.  In time series regressions, for example, time is often a proxy for other causal variables, such as population or income, which may not be available for the analysis.  Using time as the proxy explanatory variable has the advantage that it is straightforward to project into any forecast period.  However, the implicit assumption is that the actual causal variables will have both linear trends and stable relationships to the dependent variable.  This assumption is seldom made clear and often not true.
 
If two or more independent variables are highly correlated with each other, we can construct a composite of the correlated variables and use this as a single variable.  Variables can be combined using factor analysis or simpler methods.  Regressing individual variables against the composite and examining the residuals may reveal which variables supply the most useful independent information. 

We can use binary or dummy variables (0, 1) in regression when a factor, e.g., Bachelors degree, may influence the dependent variable, e.g., income.  This may be superior to using a continuous quantitative variable such as years of education.  These "qualitative" variables may affect both intercept and slope and may require separate terms.  If there are additional categories, such as Masters and Doctoral degrees, it is better to use separate dummy variables.  Combining them into one variable (0,1,2,3) would assume equal impacts going from lower to higher categories.  We should use dummy variables sparingly however, since few causal factors simplify easily to yes or no choices and influences.

We can use interaction terms, e.g., Y times Z, when the effect of one variable depends directly upon another.  One or both can be dummy terms.  We may use "distributed lags" when we suspect the variable's influence is distributed over more than one period.  There are techniques for calculating how to distribute these lags.  We can transform, disaggregate, or combine variables to create new calculated variables.  Ratios or differences between variables are often useful.  The coefficients of some variables can be set or adjusted based on previous knowledge.

All variables should be logical as well as useful for improving the fit of the model.  It is usually possible to find variables that "work"; however, the variable also needs to make sense.  We should test all variables in combination with the other possible variables to assess their particular contributions.  Interpreting relationships is more difficult than establishing correlation.  Regression coefficients are often hard to interpret or meaningless, especially if there are many possible combinations of explanatory variables.  However, in some cases coefficients can be extremely meaningful and useful in describing and quantifying actual relationships.


TESTING PROCEDURES      Testing explanatory variables begins with looking at coefficients, standard errors and t ratios; but these statistics alone may not be enough for the final selection of variables.  One way to test for significance is to watch the coefficient of one variable as other uncorrelated variables and combinations of variables are added to the model.  If the coefficient is stable, this is evidence for the importance of the variable.  This is a particularly useful method when testing simultaneously for seasonal terms with other variables in time series regressions.   Both the meaning and significance of coefficients depend on their stability. 

Generally it makes sense to drop variables with very small coefficients.  We can standardize variables to more easily compare coefficients.  However, an explanatory variable with a small coefficient is sometimes important to a model; analysis of residuals is useful in identifying such cases.  Stepwise regression techniques, which include the option of throwing out previous selections at each step, are important tools in variable selection.  We can examine all possible subsets when there are not too many variables.  It is important to combine these testing processes with other relevant information and to carefully think through the logic of possible relationships.

Any explanatory variable with a linear trend can completely explain any dependent variable with a linear trend.  Therefore, there is no necessary significance to this correlation.  Explaining the deviations from linear trends is the more difficult and more important objective. Variables can have their trends removed (e.g., by using first differences) or a linear explanatory variable can be added to account for the trend in the dependent variable (e.g., time in time series regressions).


EVALUATING REGRESSION MODELS      Any errors in the data represent a source of error in the model and will complicate the selection of variables and the evaluation of models.  We should attempt to improve data accuracy or at least to estimate the possible level of error.   In comparing the relative merits of different regression models, statistics of fit can be useful.  However, there are many caveats when evaluating model statistics.  Trends in series should be removed.  The R squared of a model measures the explained variation as a percentage of the total variation. If trends are not removed, the steeper the slope of the dependent variable, the higher the R squared is likely to be, because any variable with a linear trend can perfectly predict any linear trend of any dependent variable. If the trend is not removed, coefficient of variation is more meaningful than R squared since it compares unexplained variation to the mean of the series rather than to the total variation. 

The most important possible specification errors are: omission of relevant variables; inclusion of irrelevant variables; qualitative change in an explanatory variable; and nonlinearity.  Sub-sampling and residual analysis are good methods for finding these errors and evaluating models.  Sub-sampling refers to testing the model on sections of a data series or sample to see if the fit is stable.  If the coefficients and fit are stable, this confirms the explanatory power of the model. Similarly, testing the model on other samples of similar data or increasing the size of the original sample can be useful.  If a model fits one part of the data better than another, this may be due to some previously unseen difference in the two sub-samples.  This difference could be modeled by adding an explanatory variable to the model.  A dummy variable might be used but only if it corresponds to some legitimate causal difference.  (One-time changes in relationships could be caused by events such as inventions or institutional changes.)

Residual analysis is critical in checking for errors and evaluating models.  Residuals must be random, independent and have constant variance.  Violations of these requirements often provide clues to the type and source of error in the model.  Autoregression in residuals suggests an unexplained systematic influence; this usually is a sign of an omitted variable or nonlinearity. 

FORECASTING      Regression is a tool for the analysis of relationships among known variables.  Using regression to forecast the dependent variable assumes the stability of all of the estimated relationships through the forecast period.  Confidence in the forecast is justified only to the degree that this assumption is valid.  Forecasting the dependent variable requires also forecasting the explanatory variables, unless the lags involved are longer than the forecast period.  Using explanatory variables, that have been forecasted to forecast the dependent variable, may be so unreliable as to make the whole exercise futile.  If both regressions have an R squared of 0.7, the combination will predict only 49% of the variations in the dependent variable, (.7 x .7).  However, the more important an explanatory variable is, the more incentive we have to attempt to forecast it.  Time and seasonal terms are mechanical and pose no problem, but the difficulty of forecasting other explanatory variables may argue strongly against their use.  Time and seasonal terms are usually proxies for other variables.  We cannot assume the relationships that made them useful proxies for these factors will continue.  We can run alternative forecasts with and without any particular explanatory variable to test the impact and importance of that variable on the forecast. 

When we use regression to forecast, we should be even more aware of the limitations and difficulties involved in any quantitative analysis.  Regression is a valuable tool for analyzing complex simultaneous relationships; but forecasting creates great challenges for our limited powers of understanding.




No comments:

Post a Comment