- Published: 15 June 2017

**authors:** Eldad Haber, Lars Ruthotto**journal:** (submitted)**publication year:** (2017)**links:** arxiv (preprint)

**abstract:** Deep neural networks have become invaluable tools for supervised machine learning, e.g., in classification of text or images. While offering superior flexibility to find and express complicated patterns in data, deep architectures are known to be challenging to design and train so that they generalize well to new data. An important issue are numerical instabilities in derivative-based learning algorithms commonly called exploding or vanishing gradients. In this paper we propose new forward propagation techniques inspired by systems of Ordinary Differential Equations (ODE) that overcome this challenge and lead to well-posed learning problems for arbitrarily deep networks. The backbone of our approach is interpreting deep learning as a parameter estimation problem of a nonlinear dynamical system. Given this formulation we analyze stability and well-posedness of deep learning and motivated by our findings develop new architectures. We relate the exploding and vanishing gradient phenomenon to the stability of the discrete ODE and present several strategies for stabilizing deep learning for very deep networks. While our new architectures restrict the solution space, several numerical experiments show their competitiveness to state-of-the-art networks.

- Published: 11 June 2017

**authors:** Stéphane Mallat**journal:** Philisophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences**publication year:** 2016**links:** arxiv (preprint), Royal Society Publishing

**abstract:** Deep convolutional networks provide state of the art classifications and regressions results over many high-dimensional problems. We review their architecture, which scatters data with a cascade of linear filter weights and non-linearities. A mathematical framework is introduced to analyze their properties. Computations of invariants involve multiscale contractions, the linearization of hierarchical symmetries, and sparse separations. Applications are discussed.

- Published: 11 June 2017

**authors:** Kurt Hornik, Maxwell Stinchcombe, Halbert White**journal:** Neural Networks**publication year:** 1989**links:** Semantic Scholar, ScienceDirect

**abstract:** This paper rigorously establishes that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators.

- Published: 11 June 2017

**authors:** Stéphane Mallat, Xiuyuan Cheng, Xu Chen**journal:** Information and Inference: A Journal of the IMA**publication year:** 2016**links:** arxiv (preprint), journal

**abstract:** An orthogonal Haar scattering transform is a deep network, computed with a hierarchy of additions, subtractions and absolute values, over pairs of coefficients. It provides a simple mathematical model for unsupervised deep network learning. It implements non-linear contractions, which are optimized for classification, with an unsupervised pair matching algorithm, of polynomial complexity. A structured Haar scattering over graph data computes permutation invariant representations of groups of connected points in the graph. If the graph connectivity is unknown, unsupervised Haar pair learning can provide a consistent estimation of connected dyadic groups of points. Classification results are given on image data bases, defined on regular grids or graphs, with a connectivity which may be known or unknown.

- Published: 11 June 2017

**authors:** G. Cybenko**journal:** Mathematics of Control, Signals and Systems**publication year:** 1989**links:** Carnegie Mellon University, SpringerLink

**abstract:** In this paper we demonstrate that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function ofn real variables with support in the unit hypercube; only mild conditions are imposed on the univariate function. Our results settle an open question about representability in the class of single hidden layer neural networks. In particular, we show that arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity. The paper discusses approximation properties of other possible types of nonlinearities that might be implemented by artificial neural networks.

- Published: 11 June 2017

**authors:** Thomas Wiatowski, Helmut Bölcskei**journal:** Proceedings of IEEE International Symposium on Information Theory (ISIT)**publication year:** 2015**links:** arxiv (preprint), IEEE

**abstract:** Deep convolutional neural networks have led to breakthrough results in practical feature extraction applications. The mathematical analysis of these networks was pioneered by Mallat, 2012. Specifically, Mallat considered so-called scattering networks based on identical semi-discrete wavelet frames in each network layer, and proved translation-invariance as well as deformation stability of the resulting feature extractor. The purpose of this paper is to develop Mallat's theory further by allowing for different and, most importantly, general semi-discrete frames (such as, e.g., Gabor frames, wavelets, curvelets, shearlets, ridgelets) in distinct network layers. This allows to extract wider classes of features than point singularities resolved by the wavelet transform. Our generalized feature extractor is proven to be translation-invariant, and we develop deformation stability results for a larger class of deformations than those considered by Mallat. For Mallat's wavelet-based feature extractor, we get rid of a number of technical conditions. The mathematical engine behind our results is continuous frame theory, which allows us to completely detach the invariance and deformation stability proofs from the particular algebraic structure of the underlying frames.

- Published: 11 June 2017

**authors:** Guido Montufar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio**journal:** Advances in Neural Information Processing Systems**publication year:** 2014**links:** arxiv (preprint), NIPS proceedings

**abstract:** We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.

- Published: 11 June 2017

**authors:** Ashish Bora, Ajil Jalal, Eric Price, Alexandros G. Dimakis**journal:** International Conference on Machine Learning (ICML)**publication year:** 2017**links:** arxiv (preprint)

**abstract:** The goal of compressed sensing is to estimate a vector from an underdetermined system of noisy linear measurements, by making use of prior knowledge on the structure of vectors in the relevant domain. For almost all results in this literature, the structure is represented by sparsity in a well-chosen basis. We show how to achieve guarantees similar to standard compressed sensing but without employing sparsity at all. Instead, we suppose that vectors lie near the range of a generative model $G: \mathbb{R}^k \to \mathbb{R}^n$. Our main theorem is that, if $G$ is $L$-Lipschitz, then roughly $O(k \log L)$ random Gaussian measurements suffice for an $\ell_2/\ell_2$ recovery guarantee. We demonstrate our results using generative models from published variational autoencoder and generative adversarial networks. Our method can use $5$-$10$x fewer measurements than Lasso for the same accuracy.

**note:** The source code as well as the *trained* model are made publicly available here.