%0 Journal Article
%T Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning
%A Gregor Wiedemann
%J Social Science Computer Review
%@ 1552-8286
%D 2019
%R 10.1177/0894439318758389
%X Supervised machine learning is a promising methodological innovation for content analysis (CA) to approach the challenge of ever-growing amounts of text in the digital era. Social scientists have pointed to accurate measurement of category proportions and trends in large collections as their primary goal. Proportional classification, for example, allows for time-series analysis of diachronic data sets or correlation of categories with text-external covariates. We evaluate the performance of two common approaches for this goal: a method based on regression analysis with feature profiles from entire collections and a method aggregating classifier decisions for individual documents. For both, we observed a significant negative effect on classification performance due to the uneven distribution of characteristic language structures within the text collection. For proportional classification, this poses considerable problems. To fix this problem, we propose a workflow of active learning, which alternates between machine learning and human coding. Results from experiments with empirical data (political manifestos) demonstrate that active learning enables researchers to create training sets for automatic CA efficiently, reliably, and with high accuracy for the desired goal while retaining control over the automatic process
%K content analysis
%K active learning
%K proportional classification
%K text classification
%K text as data
%K supervised machine learning
%K computer-assisted content analysis
%K computational social science
%K big data
%U https://journals.sagepub.com/doi/full/10.1177/0894439318758389