Dylan | Aug 26, 2019
Histograms are an invaluable tool for visualizing the distribution of data. First introduced in 1891 by the English Mathematician, Karl Pearson, they are now invaluable tools in the field of Statistics. They're often one of the very first things taught in introductory statistics courses.
Let's dive right into learning about Histograms by using them to create some visualizations of a dataset consisting of the weight of 400 mice. This dataset has been made available by the University of Luxembourg and the .csv is available for download here.
With some quick exploration of our data using Pandas, we find three columns in our data: strain, sex, weight.
For this example, we'll only consider sex and weight. Since we're interested in visualizing the distribution of our data, a histogram is a fantastic tool for the job. Using matplotlib, we're able to generate the following histogram.
The x-axis contains every recorded weight in our dataset, ranging between 17.41 grams and 32.85 grams, while the y-axis displays the number of mice who fall within a given weight range.
Each bar in the histogram is referred to as a bin. Matplotlib allows us to specify the number of bins we would like our histogram to have. In the previous example, matplotlib generated a histogram with 5 bins, which means the x-axis was divided into 5 equally sized groups.
By changing the number of bins in our histogram to 20, the newly generated histogram looks much different.
Now we can see our x-axis has been split into 20 equally sized groups. The more we increase the number of bins, the more nuance we're able to visualize in the histogram.
This quick visualization has revealed something interesting about our dataset. There appears to be two peaks in our data, the first around 21 grams and the second around 29.5 grams. In statistics, this is known as a bimodal distribution. Bimodal distributions generally indicate the existence of two distinct groups in our data.
In this particular example, some underlying understanding of Biology may lead us to hypothesize that the two distinct weight groups are likely caused by the two different sexes. Thankfully, our dataset contains information on sex too!
To test our hypothesis, let's plot two histograms, one for males in blue and another for females in pink. Then we'll have matplotlib superimpose (lay on top of one another) the two histograms onto the same graph.
Great! Our histogram appears to support our hypothesis. There is certainly a clear grouping of weight based on sex.
Choosing the Number of BinsBy having too few bins, we fail to accurately capture the trends in our data. By having too many bins, our histogram becomes too sporadic to derive any meaningful distributions.
As a general rule of thumb, the more examples or rows in our data, the more bins we'll need to accurately capture nuances in our data. We can achieve this sort of scaling by simply setting our bins to the square root of the number of rows in our dataset. In the example above, we had records for 400 mice and the square root of 400 is 20, so I set the number of bins equal to 20.
There are many methods for selecting an optimal number of bins and you can read more about them here but it's very important to experiment with different bin sizes because it often allows us to uncover new things about our data.
As always, thanks for reading! Please leave a comment below if you found anything confusing and I'll do my best to clear things up for you. This post was meant to be a general introduction to histograms; however, keep an eye out for an upcoming post about creating and manipulating histograms using Python. We'll take a deeper dive into what matplotlib can do with histograms!