You are working with the text-only light edition of "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8". Click here for further information.

Correlation and Causality

Observing a correlation between two variables may seduce someone into seeing a causal relationship between these variables. However, this is often not the case. The following summary presents the most important aspects of correlation and causality.

Correlation by formal means If two independent variables X and Y are divided by a variable Z which is correlated to either X or Y, the resulting variables X' and Y' are correlated.

The same is true for variables which are normalized to a sum of 100 percent (as it is often the case with tables of nutritive values). Such variables always show a negative correlation.

Correlation by Inhomogeneity If the distribution of the data is inhomogeneous, a correlation is likely to occur. It is therefore advisable to plot the variables against each other (scatter plot of X vs. Y)

Shoe size is correlated to income. The larger the shoe size, the higher the income. (Solution: women earned less money than men. Both groups show no internal correlation, but if both groups are pooled a "correlation" occurs.)

The longer a student needs to finish his study, the higher is his income afterwards. (Solution: the time required to get a degree depends on the studies, e.g. the average time to graduate in philosophy is shorter than to get a degree in chemistry. Within the group of chemists, the income increases with decreasing time of study, but again: pooling the data creates inhomogeneity and leads to the described correlation)

Additional (hidden) variables Variables X and Y are correlated, but in fact a third parameter Z, which is not included in the data set, is correlated to both X and Y. This is particularly hard to discover, since the parameter Z may well be unknown. An important subclass of this type of correlation is time series, where time is the common variable. If both X and Y show a trend in time, correlation will be observed. 

Shoe size is correlated to the calcium content of bones. (Solution: children have less calcium in their bones than adults, naturally the shoe size of children is also smaller than that of adults)

Outliers in the data Outliers cause high correlations if the outlier is far enough away from the rest of the data.

A common spike in the signals of an analytical instrument may result in high correlation between these signals (note: spikes are a common problem in laboratories; they are e.g. caused for example by switching refrigerators). 

 As an important consequence, we have to state that mathematical correlation is no proof of causality.

Last Update: 2005-Jul-16