%0 Journal Article
%T Interpol: An R package for preprocessing of protein sequences
%A Dominik Heider
%A Daniel Hoffmann
%J BioData Mining
%D 2011
%I BioMed Central
%R 10.1186/1756-0381-4-16
%X The software "Interpol" encodes amino acid sequences as numerical descriptor vectors using a database of currently 532 descriptors (mainly from AAindex), and normalizes sequences to uniform length with one of five linear or non-linear interpolation algorithms. Interpol is distributed with open source as platform independent R-package. It is typically used for preprocessing of amino acid sequences for classification or regression.The functionality of Interpol widens the spectrum of machine learning methods that can be applied to biological sequences, and it will in many cases improve their performance in classification and regression.Machine learning techniques have been widely applied to biological sequences to gain insights into biological function, for instance Rost and Sander [1], Dubchak et al. [2], Karchin et al. [3] and Nielsen et al. [4]. Nanni and Lumini [5] have found improved performance of classifiers based on numerically encoded amino acid sequences as compared to classifiers based on the typically used standard orthonormal representation, i.e. a vector containing twenty indicator variables (one for each amino acid) for each sequence position, resulting in a matrix containing the amino acid distributions for each position in the input sequence. For numerical encoding, each amino acid (or nucleotide) of a sequence is mapped to a numerical descriptor value, such as hydropathy [6], molecular weight, or isoelectric point.One major limitation of almost all machine learning algorithms is the fixed input dimension, making these algorithms incapable of handling data which varies in its dimension. This is unsuitable for many biological applications as there are often sequence deletions and insertions.We have developed a preprocessing approach for machine learning that combines the use of numerical descriptor values with a normalization of sequences to a fixed length by numerical interpolation. This procedure has already been applied to coreceptor usage prediction
%U http://www.biodatamining.org/content/4/1/16