Scott's rule explained
Scott's rule is a method to select the number of bins in a histogram.[1] Scott's rule is widely employed in data analysis software including R,[2] Python[3] and Microsoft Excel where it is the default bin selection method.[4]
For a set of
observations
let
be the histogram approximation of some function
. The integrated mean squared error (IMSE) is
IMSE=E\left[
dx(\hat{f}(x)-f(x))2\right]
Where
denotes the
expectation across many independent draws of
data points. By
Taylor expanding to first order in
, the bin width, Scott showed that the optimal width is
h*=\left(6/
f'(x)2dx\right)1/3n-1/3
This formula is also the basis for the
Freedman–Diaconis rule.
By taking a normal reference i.e. assuming that
is a
normal distribution, the equation for
becomes
h*=\left(24\sqrt{\pi}\right)1/3\sigman-1/3\sim3.5\sigman-1/3
where
is the
standard deviation of the normal distribution and is estimated from the data. With this value of bin width Scott demonstrates that
[5]
showing how quickly the histogram approximation approaches the true distribution as the number of samples increases.
Terrell–Scott rule
Another approach developed by Terrell and Scott[6] is based on the observation that, among all densities
defined on a compact interval, say
, with derivatives which are
absolutely continuous, the density which minimises
is
fk(x)=\begin{cases}
(1-4x2)k, &|x|\leq1/2\\
0&|x|>1/2
\end{cases}
Using this with
in the expression for
gives an
upper bound on the value of bin width which is
So, for functions satisfying the continuity conditions, at least
bins should be used.
[7] This rule is also called the oversmoothed rule or the Rice rule,[8] so called because both authors worked at Rice University. The Rice rule is often reported with the factor of 2 outside the cube root,
, and may be considered a different rule. The key difference from Scott's rule is that this rule does not assume the data is normally distributed and the bin width only depends on the number of samples, not on any properties of the data.
In general
is not an integer so
\lceil\left(2n\right)1/3\rceil
is used where
denotes the
ceiling function.
Notes and References
- Scott . David W. . 1979 . On optimal and data-based histograms . Biometrika . 66 . 3. 605–610 . 10.1093/biomet/66.3.605.
- Web site: Hist function - RDocumentation .
- Web site: Numpy.histogram_bin_edges — NumPy v2.1 Manual .
- Web site: Excel:Create a histogram.
- Scott DW. Scott's rule. Wiley Interdisciplinary Reviews: Computational Statistics. 2010 Jul; 2(4):497–502.
- Terrell GR, Scott DW. Oversmoothed nonparametric density estimates. Journal of the American Statistical Association. 1985 Mar 1;80(389):209-14.
- Scott . D.W. . 2009 . Sturges' rule . WIREs Computational Statistics . 1 . 3. 303–306 . 10.1002/wics.35. 197483064 .
- Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane, Rice University (chapter 2 "Graphing Distributions", section "Histograms")