You are working with the text-only light edition of "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8". Click here for further information.

Histograms

Histograms are an efficient and common method to describe distributions of continuous variables. In general, histograms plot the frequency of occurrence of an observation within given fixed-width intervals. Histograms can be regarded as a type of classification of data. Each sample is sorted into one of several "bins" according to some property. The following shows how histograms are calculated.

An important question is the number of intervals used for the histogram: if the number of classes is too low, or too high, the histogram may hide the information in the data. Try this  to see the effect of varying interval sizes. There are serveral rules of thumb on how many classes to use:

nclass
nclass ~ 2
nclass ~ 10log10(n)

The last equation is unsuitable for a low number of observations (<50).

When constructing a histogram one should be careful to establish strict proportionality between the areas of the histogram bars and the underlying frequencies. Humans tend to interpret diagrams which do not exhibit this proportionality in a wrong and misleading way. In addition, one should avoid unequal bar widths. By using equal widths the frequencies can be directly related to the heights of the bars.

Histograms, by definition, are stair case functions. A smoother alternative to histograms can be seen in frequency polygons.


Last Update: 2006-Jän-17