全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Rice University Rule to Determine the Number of Bins

DOI: 10.4236/ojs.2024.141006, PP. 119-149

Keywords: Histogram, Class Intervals, Accuracy, Distributions, Descriptive Statistics

Full-Text   Cite this paper   Add to My Lib

Abstract:

This study aims to establish a rationale for the Rice University rule in determining the number of bins in a histogram. It is grounded in the Scott and Freedman-Diaconis rules. Additionally, the accuracy of the empirical histogram in reproducing the shape of the distribution is assessed with respect to three factors: the rule for determining the number of bins (square root, Sturges, Doane, Scott, Freedman-Diaconis, and Rice University), sample size, and distribution type. Three measures are utilized: the average distance between empirical and theoretical histograms, the level of recognition by an expert judge, and the accuracy index, which is composed of the two aforementioned measures. Mean comparisons are conducted with aligned rank transformation analysis of variance for three fixed-effects factors: sample size (20, 35, 50, 100, 200, 500, and 1000), distribution type (10 types), and empirical rule to determine the number of bins (6 rules). From the accuracy index, Rice’s rule improves with increasing sample size and is independent of distribution type. It outperforms the Friedman-Diaconis rule but falls short of Scott’s rule, except with the arcsine distribution. Its profile of means resembles the square root rule concerning distributions and Doane’s rule concerning sample sizes. These profiles differ from those of the Scott and Friedman-Diaconis rules, which resemble each other. Among the seven rules, Scott’s rule stands out in terms of accuracy, except for the arcsine distribution, and the square root rule is the least accurate.

References

[1]  Lane, D.M. (2015) Guidelines for Making Graphs Easy to Perceive, Easy to Understand, and Information Rich. In: McCrudden, M.T., Schraw, G. and Buckendahl, C., Eds., Use of Visual Displays in Research and Testing: Coding, Interpreting, and Reporting Data, Information Age Publishing, Charlotte, 47-81.
[2]  Lane, D. (2015) Histograms. Rice University, Houston.
https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Book%3A_Introductory_Statistics_(Lane)/02%3A_Graphing_Distributions/2.04%3A_Histograms
[3]  Tellechea-Robles, L.E., Salazar-Ceseña, M., Bullock, S.H., Cadena-Nava, R.D. and Méndez-Alonzo, R. (2020) Is Leaf Water-Repellency and Cuticle Roughness Linked to Flooding Regimes in Plants of Coastal Wetlands? Wetlands, 40, 515-525.
https://doi.org/10.1007/s13157-019-01190-7
[4]  Sahann, R., Müller, T. and Schmidt, J. (2021) Histogram Binning Revisited with a Focus on Human Perception. Proceedings of the 2021 IEEE Visualization Conference (VIS), New Orleans, 24-29 October 2021, 66-70.
https://doi.org/10.1109/VIS49827.2021.9623301
[5]  Pearson, K. (1892) The Grammar of Science. Walter Scott Publishing Co., London.
https://doi.org/10.1037/12962-000
[6]  Sturges, H.A. (1926) The Choice of a Class Interval. Journal of the American Statistical Association, 21, 65-66.
https://doi.org/10.1080/01621459.1926.10502161
[7]  Doane, D.P. (1976) Aesthetic Frequency Classification. The American Statistician, 30, 181-183.
https://doi.org/10.1080/00031305.1976.10479172
[8]  Scott, D.W. (1979) On Optimal and Data-Based Histograms. Biometrika, 66, 605-610.
https://doi.org/10.1093/biomet/66.3.605
[9]  Freedman, D. and Diaconis, P. (1981) On the Histogram as a Density Estimator: L2 Theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete/Journal for Probability Theory and Related Fields, 57, 453-476.
https://doi.org/10.1007/BF01025868
[10]  Guerry, A.M. (1833) Essai sur la Statistique Morale de la France. Crochard, Paris.
[11]  Nightingale, F. (1859) A Contribution to the Sanitary History of the British Army During the Late War with Russia, John W. Parker and Son, England.
[12]  Rufilanchas, D. (2017) On the Origin of Karl Pearson’s Term “Histogram”. Estadística Espa. Estadística Española, 59, 29-35.
[13]  Magnello, M.E. (1996) Karl Pearson’s Gresham Lectures: W. F. R. Weldon, Speciation and the Origins of Pearsonian Statistics. The British Journal for the History of Science, 29, 43-63.
https://doi.org/10.1017/S0007087400033859
[14]  Ioannidis, Y. (2003) The History of Histograms (Abridged). Proceedings of 2003 VLDB Conference, Berlin, 9-12 September 2003, 19-30.
https://doi.org/10.1016/B978-012722442-8/50011-2
[15]  Moore, D.S. (1986) Tests of Chi-Squared Type. In: D’Agostino, R.B. and Stephens, M.A., Eds., Goodness-of-fit-Techniques, Marcel Dekker, New York, 63-95.
https://doi.org/10.1201/9780203753064-3
[16]  Rudemo, M. (1982) Empirical Choice of Histograms and Kernel Density Estimators. Scandinavian Journal of Statistics, 9, 65-78.
https://www.jstor.org/stable/4615859
[17]  Shimazaki, H. and Shinomoto, S. (2007) A Method for Selecting the Bin Size of a Time Histogram. Neural Computation, 19, 1503-1527.
https://doi.org/10.1162/neco.2007.19.6.1503
[18]  Liu, H., Hussain, F., Tan, C.L. and Dash, M. (2002) Discretization: An Enabling Technique. Data Mining and Knowledge Discovery, 6, 393-423.
https://doi.org/10.1023/A:1016304305535
[19]  Li, H., Munk, A., Sieling, H. and Walther, G. (2020) The Essential Histogram. Biometrika, 107, 347-364.
https://doi.org/10.1093/biomet/asz081
[20]  Mohammed, M.B., Subhi, M.J. and Jamsari, A.A.W. (2022) New Approaches in Frequency Table Construction for Continuous Symmetrical Data. MATEMATIKA: Malaysian Journal of Industrial and Applied Mathematics, 38, 181-193.
https://matematika.utm.my/index.php/matematika/article/view/1415
[21]  Knuth, K.H. (2019) Optimal Data-Based Binning for Histograms and Histogram-Based Probability Density Models. Digital Signal Processing, 95, Article 102581.
https://doi.org/10.1016/j.dsp.2019.102581
[22]  Stuart, A. and Ord, K. (2010) Kendall’s Advanced Theory of Statistics: Volume 1: Distribution Theory. 6th Edition, John Wiley and Sons, New York.
[23]  RDocumentation (2020) Quantile: Sample Quantiles.
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile
[24]  George, D. and Mallery, P. (2022) IBM SPSS Statistics 27 Step by Step: A Simple Guide and Reference. 17th Edition, Routledge, New York.
[25]  Zaiontz, C. (2018) Ranking Functions in Excel. Real Statistics Using Excel.
https://real-statistics.com/descriptive-statistics/ranking-function-excel/
[26]  Hyndman, R.J. and Fan, Y. (1996) Sample Quantiles in Statistical Packages. The American Statistician, 50, 361-365.
https://doi.org/10.2307/2684934
[27]  Zaiontz, C. (2021) Harrell-Davis Quantiles. Real Statistics Using Excel.
https://real-statistics.com/descriptive-statistics/ranking-function-excel/harrell-davis-quantiles/
[28]  Schlunk, S. and Byram, B. (2022) Breaking and Fixing gCNR and Histogram Matching. Proceedings of the 2022 IEEE International Ultrasonics Symposium (IUS), Venice, 10-13 October 2022, 1-4.
https://doi.org/10.1109/IUS54386.2022.9958858
[29]  Paulauskas, N. and Baskys, A. (2019) Application of Histogram-Based Outlier Scores to Detect Computer Network Anomalies. Electronics, 8, Article 1251.
https://doi.org/10.3390/electronics8111251
[30]  Deka, K., Shah, Z.A., Misra, R. and Ahmed, G.A. (2022) A Study of The Effects of Histogram Binning on the Accuracy of Finding Flux Distribution of X-Ray Binaries. Materials Today: Proceedings, 65, 2862-2864.
https://doi.org/10.1016/j.matpr.2022.06.279
[31]  Ganesh, E.N. and Vistas, D. (2022) Image Registration of Medical Images Using Mutual Information Algorithm and Histogram Methods. 2nd National Conference on Biomedical Engineering, National Institute of Technology, Rourkela, 1 January 2022, 1-8.
https://scholar.archive.org/
[32]  Hyndman, H.J. (1995) The Problem with Sturges’ Rule for Constructing Histograms.
https://robjhyndman.com/papers/sturges.pdf
[33]  Fulp, H. and Louise, M. (2021) Dynamic Reduction of Scientific Data through Spatiotemporal Properties. Thesis, Clemson University, Clemson.
https://tigerprints.clemson.edu/all_theses/3656
[34]  Pearson, K. (1895) Contributions to the Mathematical Theory of Evolution II: Skew Variation in Homogeneous Material. Philosophical Transactions of the Royal Society of London, Series A, 186, 343-414.
https://doi.org/10.1098/rsta.1895.0010
[35]  Pearson, E.S. (1931) Note on Tests for Normality. Biometrika, 22, 423-424.
https://doi.org/10.1093/biomet/22.3-4.423
[36]  Royston, P. (1993) A Toolkit for Testing for Non-Normality in Complete and Censured Samples. Journal of the Royal Statistical Society, Series D (The Statistician), 42, 37-43.
https://doi.org/10.2307/2348109
[37]  D’Agostino, R.B., Berlanger, A. and D’Agostino, R.B. (1990) A Suggestion for Using Powerful and Informative Test of Normality. The American Statistician, 44, 316-321.
https://doi.org/10.2307/2684359
[38]  Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins, J.J. (2011) The Aligned Rank Transform for Nonparametric Factorial Analyses Using Only Anova Procedures. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York, May 2011, 143-146.
https://doi.org/10.1145/1978942.1978963
[39]  Meier, U. (2006) A Note on the Power of Fisher’s Least Significant Difference Procedure. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry, 5, 253-263.
https://doi.org/10.1002/pst.210
[40]  Holm, S. (1979) A Simple Sequentially Reject Procedure. Scandinavian Journal of Statistics, 6, 65-70.
[41]  Cohen, J. (1992) A Power Primer. Psychological Bulletin, 112, 155-159.
https://doi.org/10.1037/0033-2909.112.1.155
[42]  Ben-Shachar, M.S., Lüdecke, D. and Makowski, D. (2020) Effectsize: Estimation of Effect Size Indices and Standardized Parameters. Journal of Open Source Software, 5, Article 2815.
https://doi.org/10.21105/joss.02815
[43]  Rosner, B. and Glynn, R.J. (2006) Interval Estimation for Rank Correlation Coefficients Based on the Probit Transformation with Extension to Measurement Error Correction of Correlated Ranked Data. Statistics in Medicine, 26, 633-646.
https://doi.org/10.1002/sim.2547
[44]  Meng, X.-L., Rosenthal, R. and Rubin, D.B. (1992) Comparing Correlated Correlation Coefficients. Psychological Bulletin, 111, 172-175.
https://doi.org/10.1037/0033-2909.111.1.172
[45]  Sulewski, P. (2021) Equal-Bin-Width Histogram versus Equal-Bin-Count Histogram. Journal of Applied Statistics, 48, 2092-2111.
https://doi.org/10.1080/02664763.2020.1784853
[46]  Kim, T., Oh, J., Kim, N., Cho, S. and Yun, S.Y. (2021) Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, 19-27 August 2021, 2628-2635.
https://doi.org/10.48550/arXiv.2105.08919

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413