%0 Journal Article
%T Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
%A Djork-Arné Clevert
%A Thomas Unterthiner
%A Sepp Hochreiter
%J Computer Science
%D 2015
%I arXiv
%X We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs also avoid a vanishing gradient via the identity for positive values. However ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero. Zero means speed up learning because they bring the gradient closer to the unit natural gradient. We show that the unit natural gradient differs from the normal gradient by a bias shift term, which is proportional to the mean activation of incoming units. Like batch normalization, ELUs push the mean towards zero, but with a significantly smaller computational footprint. While other activation functions like LReLUs and PReLUs also have negative values, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the propagated variation and information. Therefore ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. Consequently dependencies between ELUs are much easier to model and distinct concepts are less likely to interfere. We found that ELUs lead not only to faster learning, but also to better generalization performance once networks have many layers (>4). ELU networks were among top 10 reported CIFAR-10 results and yielded the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELUs considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.
%U http://arxiv.org/abs/1511.07289v2