The Shape of Data
There are many ways to visualize the shape of data. Arguably, the best way of doing it is by means of the frequency distribution
.
Consider a categorical sample, representing Color Preferences mentioned on page Data Scope:
{white,green,red,red,blue,yellow,blue,red,yellow,green,yellow,red,white,green,yellow,yellow,
yellow,white,red,yellow,white,blue,yellow,blue,white,yellow,blue,blue,white,yellow}
yellow,white,red,yellow,white,blue,yellow,blue,white,yellow,blue,blue,white,yellow}
Just by looking at this sample, it is not easy to see how this variable is shaped:
- Which color (generally - category) has the highest preference?
- Which color has the lowest preference?
- Are the preferences shaped evenly?
- Is there a dominating color or group of colors?
- Etc.
Sorting the sample may provide a richer view:
{blue,blue,blue,blue,blue,blue,green,green,green,red,red,red,red,red,white,white,white,white,white,white,
yellow,yellow,yellow,yellow,yellow,yellow,yellow,yellow,yellow,yellow}
yellow,yellow,yellow,yellow,yellow,yellow,yellow,yellow,yellow,yellow}
Even a primitive aggregate view, showing frequencies
of the categories, does a better job:
| red | | | 5 | | | ***** |
| blue | | | 6 | | | ****** |
| green | | | 3 | | | *** |
| white | | | 6 | | | ****** |
| yellow | | | 10 | | | ********** |
Right away, one can better see which category is dominating, which one is neglected or less popular.
Arranging the frequency distribution in descending order by the frequencies provides even a better insight onto the data shape. A diagram based on such a distribution is referred to as the Pareto diagram:
| yellow | | | 10 | | | ********** |
| blue | | | 6 | | | ****** |
| white | | | 6 | | | ****** |
| red | | | 5 | | | ***** |
| green | | | 3 | | | *** |
