|
Principal Component Analysis for Authorship AttributionKeywords: principal components , authorship attribution , stylometry , text categorization , function words , classification task , stylistic features , syntactic characteristics Abstract: A common problem in statistical pattern recognition is that offeature selection or feature extraction. Feature selection refers to a processwhereby a data space is transformed into a feature space that, in theory,has exactly the same dimension as the original data space. However, thetransformation is designed in such a way that the data set may berepresented by a reduced number of "effective" features and yet retain mostof the intrinsic information content of the data; in other words, the data setundergoes a dimensionality reduction. In this paper the data collected bycounting words and characters in around a thousand paragraphs of eachsample book underwent a principal component analysis performed usingneural networks. Then first of the principal components is used todistinguished the books authored by a certain author.
|