This study aims to establish a rationale for the
Rice University rule in determining the number of bins in a histogram. It is
grounded in the Scott and Freedman-Diaconis rules. Additionally, the accuracy
of the empirical histogram in reproducing the shape of the distribution is
assessed with respect to three factors: the rule for determining the number of
bins (square root, Sturges, Doane, Scott, Freedman-Diaconis, and Rice
University), sample size, and distribution type. Three measures are utilized:
the average distance between empirical and theoretical histograms, the level of
recognition by an expert judge, and the accuracy index, which is composed of the
two aforementioned measures. Mean comparisons are conducted with aligned rank
transformation analysis of variance for three fixed-effects factors: sample
size (20, 35, 50, 100, 200, 500, and 1000), distribution type (10 types), and
empirical rule to determine the number of
bins (6 rules). From the accuracy index, Rice’s rule improves with
increasing sample size and is independent of distribution type. It outperforms
the Friedman-Diaconis rule but falls short of Scott’s rule, except with the
arcsine distribution. Its profile of means resembles the square root rule
concerning distributions and Doane’s rule concerning sample sizes. These
profiles differ from those of the Scott and Friedman-Diaconis rules, which
resemble each other. Among the seven rules, Scott’s rule stands out in terms of
accuracy, except for the arcsine distribution, and the square root rule is the
least accurate.
References
[1]
Lane, D.M. (2015) Guidelines for Making Graphs Easy to Perceive, Easy to Understand, and Information Rich. In: McCrudden, M.T., Schraw, G. and Buckendahl, C., Eds., Use of Visual Displays in Research and Testing: Coding, Interpreting, and Reporting Data, Information Age Publishing, Charlotte, 47-81.
[2]
Lane, D. (2015) Histograms. Rice University, Houston.
https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Book%3A_Introductory_Statistics_(Lane)/02%3A_Graphing_Distributions/2.04%3A_Histograms
[3]
Tellechea-Robles, L.E., Salazar-Ceseña, M., Bullock, S.H., Cadena-Nava, R.D. and Méndez-Alonzo, R. (2020) Is Leaf Water-Repellency and Cuticle Roughness Linked to Flooding Regimes in Plants of Coastal Wetlands? Wetlands, 40, 515-525.
https://doi.org/10.1007/s13157-019-01190-7
[4]
Sahann, R., Müller, T. and Schmidt, J. (2021) Histogram Binning Revisited with a Focus on Human Perception. Proceedings of the 2021 IEEE Visualization Conference (VIS), New Orleans, 24-29 October 2021, 66-70.
https://doi.org/10.1109/VIS49827.2021.9623301
[5]
Pearson, K. (1892) The Grammar of Science. Walter Scott Publishing Co., London.
https://doi.org/10.1037/12962-000
[6]
Sturges, H.A. (1926) The Choice of a Class Interval. Journal of the American Statistical Association, 21, 65-66. https://doi.org/10.1080/01621459.1926.10502161
[7]
Doane, D.P. (1976) Aesthetic Frequency Classification. The American Statistician, 30, 181-183. https://doi.org/10.1080/00031305.1976.10479172
[8]
Scott, D.W. (1979) On Optimal and Data-Based Histograms. Biometrika, 66, 605-610.
https://doi.org/10.1093/biomet/66.3.605
[9]
Freedman, D. and Diaconis, P. (1981) On the Histogram as a Density Estimator: L2 Theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete/Journal for Probability Theory and Related Fields, 57, 453-476.
https://doi.org/10.1007/BF01025868
[10]
Guerry, A.M. (1833) Essai sur la Statistique Morale de la France. Crochard, Paris.
[11]
Nightingale, F. (1859) A Contribution to the Sanitary History of the British Army During the Late War with Russia, John W. Parker and Son, England.
[12]
Rufilanchas, D. (2017) On the Origin of Karl Pearson’s Term “Histogram”. Estadística Espa. Estadística Española, 59, 29-35.
[13]
Magnello, M.E. (1996) Karl Pearson’s Gresham Lectures: W. F. R. Weldon, Speciation and the Origins of Pearsonian Statistics. The British Journal for the History of Science, 29, 43-63. https://doi.org/10.1017/S0007087400033859
[14]
Ioannidis, Y. (2003) The History of Histograms (Abridged). Proceedings of 2003 VLDB Conference, Berlin, 9-12 September 2003, 19-30.
https://doi.org/10.1016/B978-012722442-8/50011-2
[15]
Moore, D.S. (1986) Tests of Chi-Squared Type. In: D’Agostino, R.B. and Stephens, M.A., Eds., Goodness-of-fit-Techniques, Marcel Dekker, New York, 63-95.
https://doi.org/10.1201/9780203753064-3
[16]
Rudemo, M. (1982) Empirical Choice of Histograms and Kernel Density Estimators. Scandinavian Journal of Statistics, 9, 65-78.
https://www.jstor.org/stable/4615859
[17]
Shimazaki, H. and Shinomoto, S. (2007) A Method for Selecting the Bin Size of a Time Histogram. Neural Computation, 19, 1503-1527.
https://doi.org/10.1162/neco.2007.19.6.1503
[18]
Liu, H., Hussain, F., Tan, C.L. and Dash, M. (2002) Discretization: An Enabling Technique. Data Mining and Knowledge Discovery, 6, 393-423.
https://doi.org/10.1023/A:1016304305535
[19]
Li, H., Munk, A., Sieling, H. and Walther, G. (2020) The Essential Histogram. Biometrika, 107, 347-364. https://doi.org/10.1093/biomet/asz081
[20]
Mohammed, M.B., Subhi, M.J. and Jamsari, A.A.W. (2022) New Approaches in Frequency Table Construction for Continuous Symmetrical Data. MATEMATIKA: Malaysian Journal of Industrial and Applied Mathematics, 38, 181-193.
https://matematika.utm.my/index.php/matematika/article/view/1415
[21]
Knuth, K.H. (2019) Optimal Data-Based Binning for Histograms and Histogram-Based Probability Density Models. Digital Signal Processing, 95, Article 102581.
https://doi.org/10.1016/j.dsp.2019.102581
[22]
Stuart, A. and Ord, K. (2010) Kendall’s Advanced Theory of Statistics: Volume 1: Distribution Theory. 6th Edition, John Wiley and Sons, New York.
George, D. and Mallery, P. (2022) IBM SPSS Statistics 27 Step by Step: A Simple Guide and Reference. 17th Edition, Routledge, New York.
[25]
Zaiontz, C. (2018) Ranking Functions in Excel. Real Statistics Using Excel.
https://real-statistics.com/descriptive-statistics/ranking-function-excel/
[26]
Hyndman, R.J. and Fan, Y. (1996) Sample Quantiles in Statistical Packages. The American Statistician, 50, 361-365. https://doi.org/10.2307/2684934
[27]
Zaiontz, C. (2021) Harrell-Davis Quantiles. Real Statistics Using Excel.
https://real-statistics.com/descriptive-statistics/ranking-function-excel/harrell-davis-quantiles/
[28]
Schlunk, S. and Byram, B. (2022) Breaking and Fixing gCNR and Histogram Matching. Proceedings of the 2022 IEEE International Ultrasonics Symposium (IUS), Venice, 10-13 October 2022, 1-4. https://doi.org/10.1109/IUS54386.2022.9958858
[29]
Paulauskas, N. and Baskys, A. (2019) Application of Histogram-Based Outlier Scores to Detect Computer Network Anomalies. Electronics, 8, Article 1251.
https://doi.org/10.3390/electronics8111251
[30]
Deka, K., Shah, Z.A., Misra, R. and Ahmed, G.A. (2022) A Study of The Effects of Histogram Binning on the Accuracy of Finding Flux Distribution of X-Ray Binaries. Materials Today: Proceedings, 65, 2862-2864.
https://doi.org/10.1016/j.matpr.2022.06.279
[31]
Ganesh, E.N. and Vistas, D. (2022) Image Registration of Medical Images Using Mutual Information Algorithm and Histogram Methods. 2nd National Conference on Biomedical Engineering, National Institute of Technology, Rourkela, 1 January 2022, 1-8. https://scholar.archive.org/
[32]
Hyndman, H.J. (1995) The Problem with Sturges’ Rule for Constructing Histograms.
https://robjhyndman.com/papers/sturges.pdf
[33]
Fulp, H. and Louise, M. (2021) Dynamic Reduction of Scientific Data through Spatiotemporal Properties. Thesis, Clemson University, Clemson.
https://tigerprints.clemson.edu/all_theses/3656
[34]
Pearson, K. (1895) Contributions to the Mathematical Theory of Evolution II: Skew Variation in Homogeneous Material. Philosophical Transactions of the Royal Society of London, Series A, 186, 343-414. https://doi.org/10.1098/rsta.1895.0010
[35]
Pearson, E.S. (1931) Note on Tests for Normality. Biometrika, 22, 423-424.
https://doi.org/10.1093/biomet/22.3-4.423
[36]
Royston, P. (1993) A Toolkit for Testing for Non-Normality in Complete and Censured Samples. Journal of the Royal Statistical Society, Series D (The Statistician), 42, 37-43. https://doi.org/10.2307/2348109
[37]
D’Agostino, R.B., Berlanger, A. and D’Agostino, R.B. (1990) A Suggestion for Using Powerful and Informative Test of Normality. The American Statistician, 44, 316-321.
https://doi.org/10.2307/2684359
[38]
Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins, J.J. (2011) The Aligned Rank Transform for Nonparametric Factorial Analyses Using Only Anova Procedures. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York, May 2011, 143-146. https://doi.org/10.1145/1978942.1978963
[39]
Meier, U. (2006) A Note on the Power of Fisher’s Least Significant Difference Procedure. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry, 5, 253-263. https://doi.org/10.1002/pst.210
[40]
Holm, S. (1979) A Simple Sequentially Reject Procedure. Scandinavian Journal of Statistics, 6, 65-70.
[41]
Cohen, J. (1992) A Power Primer. Psychological Bulletin, 112, 155-159.
https://doi.org/10.1037/0033-2909.112.1.155
[42]
Ben-Shachar, M.S., Lüdecke, D. and Makowski, D. (2020) Effectsize: Estimation of Effect Size Indices and Standardized Parameters. Journal of Open Source Software, 5, Article 2815. https://doi.org/10.21105/joss.02815
[43]
Rosner, B. and Glynn, R.J. (2006) Interval Estimation for Rank Correlation Coefficients Based on the Probit Transformation with Extension to Measurement Error Correction of Correlated Ranked Data. Statistics in Medicine, 26, 633-646.
https://doi.org/10.1002/sim.2547
[44]
Meng, X.-L., Rosenthal, R. and Rubin, D.B. (1992) Comparing Correlated Correlation Coefficients. Psychological Bulletin, 111, 172-175.
https://doi.org/10.1037/0033-2909.111.1.172
[45]
Sulewski, P. (2021) Equal-Bin-Width Histogram versus Equal-Bin-Count Histogram. Journal of Applied Statistics, 48, 2092-2111.
https://doi.org/10.1080/02664763.2020.1784853
[46]
Kim, T., Oh, J., Kim, N., Cho, S. and Yun, S.Y. (2021) Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, 19-27 August 2021, 2628-2635. https://doi.org/10.48550/arXiv.2105.08919