You are working with the text-only light edition of "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8". Click here for further information. |
Table of Contents Bivariate Data Correlation Correlation and Causality | |
See also: correlation |
Observing a correlation between two variables may seduce someone into seeing a causal relationship between these variables. However, this is often not the case. The following summary presents the most important aspects of correlation and causality. |
Correlation by formal means | If two independent variables X and Y are divided by a variable Z which
is correlated to either X or Y, the resulting variables X' and Y' are correlated.
The same is true for variables which are normalized to a sum of 100 percent (as it is often the case with tables of nutritive values). Such variables always show a negative correlation. |
Correlation by Inhomogeneity | If the distribution of the data is inhomogeneous, a correlation is
likely to occur. It is therefore advisable to plot the variables against
each other (scatter plot of X vs. Y)
Shoe size is correlated to income. The larger the shoe size, the higher the income. (Solution: women earned less money than men. Both groups show no internal correlation, but if both groups are pooled a "correlation" occurs.) The longer a student needs to finish his study, the higher is his income afterwards. (Solution: the time required to get a degree depends on the studies, e.g. the average time to graduate in philosophy is shorter than to get a degree in chemistry. Within the group of chemists, the income increases with decreasing time of study, but again: pooling the data creates inhomogeneity and leads to the described correlation) |
Additional (hidden) variables | Variables X and Y are correlated, but in fact a third parameter Z,
which is not included in the data set, is correlated to both X and Y. This
is particularly hard to discover, since the parameter Z may well be unknown.
An important subclass of this type of correlation is time series, where
time is the common variable. If both X and Y show a trend in time, correlation
will be observed.
Shoe size is correlated to the calcium content of bones. (Solution: children have less calcium in their bones than adults, naturally the shoe size of children is also smaller than that of adults) |
Outliers in the data | Outliers cause high correlations if the outlier is far enough away
from the rest of the data.
A common spike in the signals of an analytical instrument may result in high correlation between these signals (note: spikes are a common problem in laboratories; they are e.g. caused for example by switching refrigerators). |
As an important consequence, we have to state that mathematical
correlation is no proof of causality.
Last Update: 2005-Jul-16