A weighted version of Bennett, Alpert, and Goldstein’s S, denoted by , is studied. It is shown that the special cases of are often ordered in the same way. It is also shown that many special cases of tend to produce values close to unity, especially when the number of categories of the rating scale is large. It is argued that the application of as an agreement coefficient is not without difficulties. 1. Introduction In behavioral and biomedical science it is frequently required to measure the intensity of a behavior or a disease. Examples are the degree of arousal of a speech-anxious participant while giving a presentation, the severity of lesions from scans, or the severity of sedation during opioid administration for pain management. The intensity of these phenomena is usually classified by a single observer using a rating scale with ordered categories, for example, mild, moderate, or severe. To avoid that the observer did not fully understand what he or she was asked to interpret, the categories must be clearly defined. To measure the reliability of the rating scale researchers typically ask two observers to rate independently the same set of subjects. Analysis of the agreement between the observers can then be used to asses the reliability of the scale. High agreement between the ratings of the observers usually indicates consensus in the diagnosis and interchangeability of the classifications of the observers. For assessing agreement on an ordinal scale various statistical methodologies have been developed. For example, the loglinear models presented in Tanner and Young [1] and Agresti [2, 3] can be used for analyzing the patterns of agreement and potential sources of disagreement. Applications of these models can be found in Becker [4] and Graham and Jackson [5]. However, it turns out that researchers are usually only interested in a coefficient that (roughly) summarizes the agreement in a single number. The most commonly used coefficient for summarizing agreement on an ordinal scale is weighted kappa proposed in Cohen [6] ([5, 7]). Cohen [8] proposed coefficient kappa as an index of agreement when the rating scale has nominal (unordered) categories [9]. The coefficient corrects for agreement due to chance. Weighted kappa extends Cohen’s original kappa to rating scales with ordered categories. In the latter case there is usually more disagreement between the observers on adjacent categories than on categories that are further apart. With weighted kappa it is possible to describe the closeness between categories using weights. Both kappa and
References
[1]
M. A. Tanner and M. A. Young, “Modeling ordinal scale disagreement,” Psychological Bulletin, vol. 98, no. 2, pp. 408–415, 1985.
[2]
A. Agresti, “A model for agreement between ratings on an ordinal scale,” Biometrics, vol. 44, no. 2, pp. 539–548, 1988.
[3]
A. Agresti, Analysis of Ordinal Categorical Data, Wiley, Hoboken, NJ, USA, 2nd edition, 2010.
[4]
M. P. Becker, “Using association models to analyse agreement data: two examples,” Statistics in Medicine, vol. 8, no. 10, pp. 1199–1207, 1989.
[5]
P. Graham and R. Jackson, “The analysis of ordinal agreement data: beyond weighted kappa,” Journal of Clinical Epidemiology, vol. 46, no. 9, pp. 1055–1062, 1993.
[6]
J. Cohen, “Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit,” Psychological Bulletin, vol. 70, no. 4, pp. 213–220, 1968.
[7]
M. Maclure and W. C. Willett, “Misinterpretation and misuse of the Kappa statistic,” The American Journal of Epidemiology, vol. 126, no. 2, pp. 161–169, 1987.
[8]
J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, pp. 37–46, 1960.
[9]
M. J. Warrens, “Cohen's kappa can always be increased and decreased by combining categories,” Statistical Methodology, vol. 7, no. 6, pp. 673–677, 2010.
[10]
L. M. Hsu and R. Field, “Interrater agreement measures: comments on , Cohen's kappa, Scott's π and Aickin's α,” Understanding Statistics, vol. 2, pp. 205–219, 2003.
[11]
J. Sim and C. C. Wright, “The kappa statistic in reliability studies: use, interpretation, and sample size requirements,” Physical Therapy, vol. 85, no. 3, pp. 257–268, 2005.
[12]
R. L. Brennan and D. J. Prediger, “Coefficient kappa: some uses, misuses, and alternatives,” in Educational and Psychological Measurement, vol. 41, pp. 687–699, 1981.
[13]
J. S. Uebersax, “Diversity of decision-making models and the measure ment of interrater agreement,” Psychological Bulletin, vol. 101, no. 1, pp. 140–146, 1987.
[14]
A. R. Feinstein and D. V. Cicchetti, “High agreement but low kappa: I. the problems of two paradoxes,” Journal of Clinical Epidemiology, vol. 43, no. 6, pp. 543–549, 1990.
[15]
C. A. Lantz and E. Nebenzahl, “Behavior and interpretation of the κ statistic: resolution of the two paradoxes,” Journal of Clinical Epidemiology, vol. 49, no. 4, pp. 431–434, 1996.
[16]
J. de Mast and W. N. van Wieringen, “Measurement system analysis for categorical measurements: agreement and kappa-type indices,” Journal of Quality Technology, vol. 39, pp. 191–202, 2007.
[17]
J. de Mast, “Agreement and kappa-type indices,” The American Statistician, vol. 61, no. 2, pp. 148–153, 2007.
[18]
W. D. Thompson and S. D. Walter, “A reappraisal of the kappa coefficient,” Journal of Clinical Epidemiology, vol. 41, no. 10, pp. 949–958, 1988.
[19]
W. Vach, “The dependence of Cohen's kappa on the prevalence does not matter,” Journal of Clinical Epidemiology, vol. 58, no. 7, pp. 655–661, 2005.
[20]
A. von Eye and M. von Eye, “On the marginal dependency of Cohen's κ,” European Psychologist, vol. 13, no. 4, pp. 305–315, 2008.
[21]
H. Brenner and U. Kliebsch, “Dependence of weighted kappa coefficients on the number of categories,” Epidemiology, vol. 7, no. 2, pp. 199–202, 1996.
[22]
M. J. Warrens, “Some paradoxical results for the quadratically weighted kappa,” Psychometrika, vol. 77, no. 2, pp. 315–323, 2012.
[23]
E. M. Bennett, R. Alpert, and A. C. Goldstein, “Communications through limited-response questioning,” Public Opinion Quarterly, vol. 18, no. 3, pp. 303–308, 1954.
[24]
U. N. Umesh, R. A. Peterson, and M. T. Sauber, “Interjudge agreement and the maximum value of kappa,” Educational and Psychological Measurement, vol. 49, pp. 835–850, 1989.
[25]
G. J. Meyer, “Assessing reliability: critical corrections for a critical examination of the Rorschach comprehensive system,” Psychological Assessment, vol. 9, no. 4, pp. 480–489, 1997.
[26]
M. J. Warrens, “The effect of combining categories on Bennett, Alpert and Goldstein's S,” Statistical Methodology, vol. 9, no. 3, pp. 341–352, 2012.
[27]
J. J. Randolph, “Free-marginal multirater kappa (multirater κ free): an alternative to Fleiss' fixed-marginal multirater kappa,” in Proceedings of the Joensuu Learning and Instruction Symposium, Joensuu, Finland, 2005.
[28]
S. Janson and J. Vegelius, “On generalizations of the G index and the Phi coefficient to nominal scales,” Multivariate Behavioral Research, vol. 14, no. 2, pp. 255–269, 1979.
[29]
C. L. Janes, “An extension of the random error coefficient of agreement to tables,” The British Journal of Psychiatry, vol. 134, no. 6, pp. 617–619, 1979.
[30]
J. W. Holley and J. P. Guilford, “A note on the G index of agreement,” Educational and Psychological Measurement, vol. 24, no. 4, pp. 749–753, 1964.
[31]
A. E. Maxwell, “Coefficients of agreement between observers and their interpretation,” British Journal of Psychiatry, vol. 116, pp. 651–655, 1977.
[32]
K. Krippendorff, “Association, agreement, and equity,” Quality and Quantity, vol. 21, no. 2, pp. 109–123, 1987.
[33]
K. L. Gwet, Handbook of Inter-Rater Reliability, Advanced Analytics, Gaithersburg, Md, USA, 2012.
[34]
D. Cicchetti and T. Allison, “A new procedure for assessing reliability of scoring EEG sleep recordings,” The American Journal of EEG Technology, vol. 11, pp. 101–110, 1971.
[35]
S. Vanbelle and A. Albert, “A note on the linearly weighted kappa coefficient for ordinal scales,” Statistical Methodology, vol. 6, no. 2, pp. 157–163, 2009.
[36]
M. J. Warrens, “Cohen's linearly weighted kappa is a weighted average,” Advances in Data Analysis and Classification, vol. 6, no. 1, pp. 67–79, 2012.
[37]
J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability,” Educational and Psychological Measurement, vol. 33, pp. 613–619, 1973.
[38]
C. Schuster, “A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales,” Educational and Psychological Measurement, vol. 64, no. 2, pp. 243–253, 2004.
[39]
A. Agresti, Categorical Data Analysis, John Wiley & Sons, 1990.
[40]
Y. M. M. Bishop, S. E. Fienberg, and P. W. Holland, Discrete Multi-Variate Analysis: Theory and Practice, MIT Press, Cambridge, Mass, USA, 1975.
[41]
N. D. Holmquist, C. A. McMahan, and E. O. Williams, “Variability in classification of carcinoma in situ of the uterine cervix,” Obstetrical & Gynecological Survey, vol. 23, pp. 580–585, 1967.
[42]
J. R. Landis and G. G. Koch, “An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers,” Biometrics, vol. 33, pp. 363–374, 1977.
[43]
A. F. Beardon, “Sums of powers of integers,” The American Mathematical Monthly, vol. 103, no. 3, pp. 201–213, 1996.
[44]
J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, pp. 159–174, 1977.
[45]
D. V. Cicchetti and S. S. Sparrow, “Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior,” American Journal of Mental Deficiency, vol. 86, no. 2, pp. 127–137, 1981.
[46]
P. E. Crewson, “Reader agreement studies,” American Journal of Roentgenology, vol. 184, no. 5, pp. 1391–1397, 2005.
[47]
J. L. Fleiss, B. Levin, and M. C. Paik, Statistical Methods for Rates and Proportions, Wiley-Interscience, New York, NY, USA, 3rd edition, 2003.
[48]
M. J. Warrens, “Conditional inequalities between Cohen's kappa and weighted kappas,” Statistical Methodology, vol. 10, pp. 14–22, 2013.
[49]
C. S. Martin, N. K. Pollock, O. G. Bukstein, and K. G. Lynch, “Inter-rater reliability of the SCID alcohol and substance use disorders section among adolescents,” Drug and Alcohol Dependence, vol. 59, no. 2, pp. 173–176, 2000.
[50]
J. S. Simonoff, Analyzing Categorical Data, Springer, New York, NY, USA, 2003.
[51]
S. I. Anderson, A. M. Housley, P. A. Jones, J. Slattery, and J. D. Miller, “Glasgow outcome scale: an inter-rater reliability study,” Brain Injury, vol. 7, no. 4, pp. 309–317, 1993.
[52]
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski, A Handbook of Small Data Sets, Chapman & Hall, London, UK, 1994.
[53]
M. Némethy, L. Paroli, P. G. Williams-Russo, and T. J. J. Blanck, “Assessing sedation with regional anesthesia: inter-rater agreement on a modified Wilson sedation scale,” Anesthesia and Analgesia, vol. 94, no. 3, pp. 723–728, 2002.
[54]
J. M. Seddon, C. R. Sahagian, R. J. Glynn et al., “Evaluation of an iris color classification system,” Investigative Ophthalmology and Visual Science, vol. 31, no. 8, pp. 1592–1598, 1990.
[55]
R. W. Bohannon and M. B. Smith, “Interrater reliability of a modified Ashworth scale of muscle spasticity,” Physical Therapy, vol. 67, no. 2, pp. 206–207, 1987.
[56]
V. A. J. Maria and R. M. M. Victorino, “Development and validation of a clinical scale for the diagnosis of drug-induced hepatitis,” Hepatology, vol. 26, no. 3, pp. 664–669, 1997.