Histogram¶
Source: adopted from here
Introduction¶
Histogram is a graphical display of numerical data using bars of different heights and it is an approximate representation of the distribution of data. The height of each bar shows how many data points fall into each range and you decide the ranges to use. This allows the inspection of the data for its underlying distribution (e.g. normal distribution), outliers, skewness, etc.
There are a couple of different guidelines on how to calculate the number of bins for a histogram. For a summary of different guidelines, please see Histogram on wiki page. Let's take a look at the Sturges' formula.
The number of bins k can be calculated from a suggested bin width w as:
By Sturges' formula, k can be calculated as:
where n is the total number of data points used to calculate the histogram.
Question¶
First, let's generate some random numbers:
genNormalNumber:{
pi:acos -1;
$[x=2*n:x div 2;
raze sqrt[-2*log n?1f]*/:(sin;cos)@\:(2*pi)*n?1f;
-1_.z.s 1+x
]
};
data:asc genNormalNumber[10000];
The data
generated above is a sorted list of random numbers.
Create a histogram using Sturges' formula. The output table should have three columns: the first column binIdx
is the bin index, the second column binVal
is the median value of all data points in each bin and the third column binCnt
is the number of data points falling into each bin.
Answer¶
The suggested answer is as follows.
histogramSturges:{[data]
// Calculate the total number of bins
k:1+ceiling xlog[2;count data];
// Find the min/max value of the list
minVal:min data;
maxVal:max data;
// Find the lower bound of each bin interval
bins:minVal+((maxVal-minVal)%k)*til 1+k;
// Find the bin index of each data item
binData:([] binIdx:bins binr data;data);
// 1) Calculate the number of data points in each bin, and
// 2) Compute the median value of all data points in each bin
0!select binVal:med data,binCnt:count binIdx by binIdx from binData
};
histogramSturges[data]
If you plot the above data, you will get something like below. This verifies that our normal random number generator works as expected.