Weighted kappa is a widely used statistic for summarizing inter-rater agreement on a categorical scale. For rating scales with three categories, there are seven versions of weighted kappa. It is shown analytically how these weighted kappas are related. Several conditional equalities and inequalities between the weighted kappas are derived. The analytical analysis indicates that the weighted kappas are measuring the same thing but to a different extent. One cannot, therefore, use the same magnitude guidelines for all weighted kappas. 1. Introduction In biomedical, behavioral, and engineering research, it is frequently required that a group of objects is rated on a categorical scale by two observers. Examples are the following: clinicians that classify the extent of disease in patients; pathologists that rate the severity of lesions from scans; and experts that classify production faults. Analysis of the agreement between the two observers can be used to assess the reliability of the rating system. High agreement would indicate consensus in the diagnosis and interchangeability of the observers. Various authors have proposed statistical methodology for analyzing agreement. For example, for modeling patterns of agreement, the loglinear models proposed in Tanner and Young [1] and Agresti [2, 3] can be used. However, in practice researchers are frequently only interested in a single number that quantifies the degree of agreement between the raters [4, 5]. Various statistics have been proposed in the literature [6, 7], but the most popular statistic for summarizing rater agreement is the weighted kappa introduced by Cohen [8]. Weighted kappa allows the use of weighting schemes to describe the closeness of agreement between categories. Each weighting scheme defines a different version or special case of weighted kappa. Different weighting schemes have been proposed for the various scale types. In this paper, we only consider scales of three categories. This is the smallest number of categories for which we can distinguish three types of categorical scales, namely, nominal scales, continuous-ordinal scales, and dichotomous-ordinal scales [9]. A dichotomous-ordinal scale contains a point of “absence” and two points of “presence”, for example, no disability, moderate disability, or severe disability. A continuous-ordinal scale does not have a point of “absence”. The scale can be described by three categories of “presence”, for example, low, moderate, or high. Identity weights are used when the categories are nominal [10]. In this case, weighted kappa becomes the
References
[1]
M. A. Tanner and M. A. Young, “Modeling ordinal scale disagreement,” Psychological Bulletin, vol. 98, no. 2, pp. 408–415, 1985.
[2]
A. Agresti, “A model for agreement between ratings on an ordinal scale,” Biometrics, vol. 44, no. 2, pp. 539–548, 1988.
[3]
A. Agresti, Analysis of Ordinal Categorical Data, John Wiley & Sons, Hoboken, NJ, USA, 2nd edition, 2010.
[4]
P. Graham and R. Jackson, “The analysis of ordinal agreement data: beyond weighted kappa,” Journal of Clinical Epidemiology, vol. 46, no. 9, pp. 1055–1062, 1993.
[5]
M. Maclure and W. C. Willett, “Misinterpretation and misuse of the Kappa statistic,” American Journal of Epidemiology, vol. 126, no. 2, pp. 161–169, 1987.
[6]
J. de Mast and W. N. van Wieringen, “Measurement system analysis for categorical measurements: agreement and kappa-type indices,” Journal of Quality Technology, vol. 39, no. 3, pp. 191–202, 2007.
[7]
M. J. Warrens, “Inequalities between kappa and kappa-like statistics for tables,” Psychometrika, vol. 75, no. 1, pp. 176–185, 2010.
[8]
J. Cohen, “Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit,” Psychological Bulletin, vol. 70, no. 4, pp. 213–220, 1968.
[9]
D. V. Cicchetti, “Assessing inter rater reliability for rating scales: resolving some basic issues,” British Journal of Psychiatry, vol. 129, no. 11, pp. 452–456, 1976.
[10]
M. J. Warrens, “Cohen's kappa is a weighted average,” Statistical Methodology, vol. 8, no. 6, pp. 473–484, 2011.
[11]
J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, pp. 37–46, 1960.
[12]
S. Vanbelle and A. Albert, “A note on the linearly weighted kappa coefficient for ordinal scales,” Statistical Methodology, vol. 6, no. 2, pp. 157–163, 2009.
[13]
M. J. Warrens, “Cohen's linearly weighted kappa is a weighted average of kappas,” Psychometrika, vol. 76, no. 3, pp. 471–486, 2011.
[14]
J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability,” Educational and Psychological Measurement, vol. 33, pp. 613–619, 1973.
[15]
M. J. Warrens, “Some paradoxical results for the quadratically weighted kappa,” Psychometrika, vol. 77, no. 2, pp. 315–323, 2012.
[16]
L. M. Hsu and R. Field, “Interrater agreement measures: comments on kappan, Cohen's kappa, Scott's and Aickin's ,” Understanding Statistics, vol. 2, pp. 205–219, 2003.
[17]
E. Bashkansky, T. Gadrich, and D. Knani, “Some metrological aspects of the comparison between two ordinal measuring systems,” Accreditation and Quality Assurance, vol. 16, no. 2, pp. 63–72, 2011.
[18]
J. de Mast, “Agreement and kappa-type indices,” The American Statistician, vol. 61, no. 2, pp. 148–153, 2007.
[19]
J. S. Uebersax, “Diversity of decision-making models and the measurement of interrater agreement,” Psychological Bulletin, vol. 101, no. 1, pp. 140–146, 1987.
[20]
W. D. Perreault and L. E. Leigh, “Reliability of nominal data based on qualitative judgments,” Journal of Marketing Research, vol. 26, pp. 135–148, 1989.
[21]
M. J. Warrens, “Conditional inequalities between Cohen's kappa and weighted kappas,” Statistical Methodology, vol. 10, pp. 14–22, 2013.
[22]
M. J. Warrens, “Weighted kappa is higher than Cohen's kappa for tridiagonal agreement tables,” Statistical Methodology, vol. 8, no. 2, pp. 268–272, 2011.
[23]
M. J. Warrens, “Cohen's quadratically weighted kappa is higher than linearly weighted kappa for tridiagonal agreement tables,” Statistical Methodology, vol. 9, no. 3, pp. 440–444, 2012.
[24]
S. I. Anderson, A. M. Housley, P. A. Jones, J. Slattery, and J. D. Miller, “Glasgow outcome scale: an inter-rater reliability study,” Brain Injury, vol. 7, no. 4, pp. 309–317, 1993.
[25]
C. S. Martin, N. K. Pollock, O. G. Bukstein, and K. G. Lynch, “Inter-rater reliability of the SCID alcohol and substance use disorders section among adolescents,” Drug and Alcohol Dependence, vol. 59, no. 2, pp. 173–176, 2000.
[26]
R. L. Spitzer, J. Cohen, J. L. Fleiss, and J. Endicott, “Quantification of agreement in psychiatric diagnosis. A new approach,” Archives of General Psychiatry, vol. 17, no. 1, pp. 83–87, 1967.
[27]
J. S. Simonoff, Analyzing Categorical Data, Springer, New York, NY, USA, 2003.
[28]
P. E. Castle, A. T. Lorincz, I. Mielzynska-Lohnas et al., “Results of human papillomavirus DNA testing with the hybrid capture 2 assay are reproducible,” Journal of Clinical Microbiology, vol. 40, no. 3, pp. 1088–1090, 2002.
[29]
D. Cicchetti and T. Allison, “A new procedure for assessing reliability of scoring EEG sleep recordings,” The American Journal of EEG Technology, vol. 11, pp. 101–110, 1971.
[30]
D. V. Cicchetti, “A new measure of agreement between rank ordered variables,” in Proceedings of the Annual Convention of the American Psychological Association, vol. 7, pp. 17–18, 1972.
[31]
M. J. Warrens, “Cohen's weighted kappa with additive weights,” Advances in Data Analysis and Classification, vol. 7, pp. 41–55, 2013.
[32]
Y. M. M. Bishop, S. E. Fienberg, and P. W. Holland, Discrete Multivariate Analysis: Theory and Practice, The MIT Press, Cambridge, Mass, USA, 1975.
[33]
J. L. Fleiss, J. Cohen, and B. S. Everitt, “Large sample standard errors of kappa and weighted kappa,” Psychological Bulletin, vol. 72, no. 5, pp. 323–327, 1969.
[34]
M. J. Warrens, “Cohen's kappa can always be increased and decreased by combining categories,” Statistical Methodology, vol. 7, no. 6, pp. 673–677, 2010.
[35]
M. J. Warrens, “Cohen's linearly weighted kappa is a weighted average,” Advances in Data Analysis and Classification, vol. 6, no. 1, pp. 67–79, 2012.
[36]
D. V. Cicchetti and S. A. Sparrow, “Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior,” American Journal of Mental Deficiency, vol. 86, no. 2, pp. 127–137, 1981.
[37]
P. E. Crewson, “Reader agreement studies,” American Journal of Roentgenology, vol. 184, no. 5, pp. 1391–1397, 2005.
[38]
J. R. Landis and G. G. Koch, “A one-way components of variance model for categorical data,” Biometrics, vol. 33, no. 4, pp. 159–174, 1977.