- Published: 20 June 2017

**authors:** Uri Shaham, Alexander Cloninger, Ronald R. Coifman**journal:** Applied and Computational Harmonic Analysis**publication year:** 2016**links:** arxiv (preprint), ScienceDirect

**abstract:** We discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $/Gamma /subset /R^n$ we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $/Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU).

- Published: 20 June 2017

**authors:** Allan Pinkus**journal:** Acta Numerica**publication year:** 1999**links:** Cambridge University Press, Israel Institute of Technology

**abstract:** In this survey we discuss various approximation-theoretic problems that arise in the multilayer feedforward perceptron (MLP) model in neural networks. Mathematically it is one of the simpler models. Nonetheless the mathematics of this model is not well understood, and many of these problems are approximation-theoretic in character. Most of the research we will discuss is of very recent vintage. We will report on what has been done and on various unanswered questions. We will not be presenting practical (algorithmic) methods. We will, however, be exploring the capabilities and limitations of this model.

- Published: 19 June 2017

**authors:** H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen.**journal:** (submitted)**publication year:** (2017)**links:** arxiv (preprint)

**abstract:** We derive fundamental lower bounds on the connectivity and the memory requirements of deep neural networks guaranteeing uniform approximation rates for arbitrary function classes in L_2(R^D).In other words, we establish a connection between the complexity of a function class and the complexity of deep neural networks approximating functions from this class to within a prescribed accuracy. Additionally, we prove that our lower bounds are achievable for a broad family of function classes. Specifically, all function classes that are optimally approximated by a general class of representation systems — so-called affine systems — can be approximated by deep neural networks with minimal connectivity and memory requirements. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets,α-shearlets, and more generally α-molecules. This result elucidates a remarkable universality property of neural networks and shows that they achieve the optimum approximation properties of all affine systems combined. As a specific example, we consider the class of 1/α-cartoon-like functions, which is approximated optimally by α-shearlets. We also explain how our results can be extended to the case of functions on low-dimensional immersed manifolds. Finally, we present numerical experiments demonstrating that the standard stochastic gradient descent algorithm gener-ates deep neural networks providing close-to-optimal approximation rates at minimal connectivity. Moreover, these results show that stochastic gradient descent actually learns approximations that are sparse in the representation systems optimally sparsifying the function class the network is trained on.

- Published: 20 June 2017

**authors:** Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun**journal:** Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)**publication year:** 2016**links:** arxiv (preprint), IEEE

**abstract:** Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers - 8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

- Published: 15 June 2017

**authors:** Behnam Neyshabur, Ryota Tomioka, Nathan Srebro**journal:** JMLR: Workshop and Conference Proceedings**publication year:** 2015**links:** JMLR

**abstract:** We investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.

- Published: 20 June 2017

**authors:** Charles K. Chui, Xin Li, Hrushikesh N. Mhaskar**journal:** Mathematics of Computation**publication year:** 1994**links:** jstor

**abstract:** We prove that feedforward artificial neural networks with a single hidden layer and an ideal sigmoidal response function cannot provide localized approximation in a Euclidean space of dimension higher than one. We also show that networks with two hidden layers can be designed to provide localized approximation. Since wavelet bases are most effective for local approximation, we give a discussion of the implementation of spline wavelets using multilayered networks where the response function is a sigmoidal function of order at least two.

- Published: 15 June 2017

**authors:** Adam S. Charles, Dong Yin, Christopher J. Rozell**journal:** Journal of Machine Learning Research**publication year:** 2017**links:** arxive (preprint), JMLR

**abstract:** Recurrent neural networks (RNNs) have drawn interest from machine learning researchers because of their effectiveness at preserving past inputs for time-varying data processing tasks. To understand the success and limitations of RNNs, it is critical that we advance our analysis of their fundamental memory properties. We focus on echo state networks (ESNs), which are RNNs with simple memoryless nodes and random connectivity. In most existing analyses, the short-term memory (STM) capacity results conclude that the ESN network size must scale linearly with the input size for unstructured inputs. The main contribution of this paper is to provide general results characterizing the STM capacity for linear ESNs with multidimensional input streams when the inputs have common low-dimensional structure: sparsity in a basis or significant statistical dependence between inputs. In both cases, we show that the number of nodes in the network must scale linearly with the information rate and poly-logarithmically with the input dimension. The analysis relies on advanced applications of random matrix theory and results in explicit non-asymptotic bounds on the recovery error. Taken together, this analysis provides a significant step forward in our under standing of the STM properties in RNNs.

- Published: 20 June 2017

**authors:** Emmanuel J. Candès**journal:** (Stanford University Dept. of Statistics: Technical report)**publication year:** 1998**links:** google books, Stanford Statistics Department

**abstract:** Single hidden-layer feedforward neural networks have been proposed as an approach to bypass the curse of dimensionality and are now becoming widely applied to approximation or prediction in applied sciences. In that approach, one approximates a multivariate target function by a sum of ridge functions; this is similar to projection pursuit in the literature of statistics. This approach poses new and challenging questions both at a practical and theorectical level, ranging from the construction of neural networks to their efficiency and capability. The topic of this thesis is to show that ridgelets, a new set of functions, provide an elegant tool to answer some of these fundamental questions.

- Published: 15 June 2017

**authors:** Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, Sanjeev Arora**journal:** (submitted)**publication year:** (2017)**links:** arxiv (preprint)

**abstract:** Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution---also theoretically not understood---concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks. We take a first cut at explaining the expressivity of multilayer nets by giving a sufficient criterion for a function to be approximable by a neural network with n hidden layers. A key ingredient is Barron's Theorem \cite{Barron1993}, which gives a Fourier criterion for approximability of a function by a neural network with 1 hidden layer. We show that a composition of n functions which satisfy certain Fourier conditions ("Barron functions") can be approximated by a (n+1)-layer neural network. For probability distributions, this translates into a criterion for a probability distribution to be approximable in Wasserstein distance---a natural metric on probability distributions---by a neural network applied to a fixed base distribution (e.g., multivariate gaussian). Building up recent lower bound work, we also give an example function that shows that composition of Barron functions is more expressive than Barron functions alone.

- Published: 20 June 2017

**authors:** Andrew R. Barron**journal:** IEEE Transactions on Information Theory**publication year:** 1993**links:** IEEE, Yale Statistics

**abstract:** Approximation properties of a class of artificial neural networks are established. It is shown that feedforward networks with one layer of sigmoidal nonlinearities achieve integrated squared error of order O(1/n), where n is the number of nodes. The approximated function is assumed to have a bound on the first moment of the magnitude distribution of the Fourier transform. The nonlinear parameters associated with the sigmoidal nodes, as well as the parameters of linear combination, are adjusted in the approximation. In contrast, it is shown that for series expansions with n terms, in which only the parameters of linear combination are adjusted, the integrated squared approximation error cannot be made smaller than order 1/n/sup 2/d/ uniformly for functions satisfying the same smoothness assumption, where d is the dimension of the input to the function. For the class of functions examined, the approximation rate and the parsimony of the parameterization of the networks are shown to be advantageous in high-dimensional settings.

- Published: 15 June 2017

**authors:** Andrew R. Barron**journal:** Machine Learning**publication year:** 1994**links:** SpringerLink

**abstract:** For a common class of artificial neural networks, the mean integrated squared error between the estimated network and a target functionf is shown to be bounded by
O(c_f^2/n) + O(nd/N logN)
wheren is the number of nodes,d is the input dimension of the function,N is the number of training observations, andCf is the first absolute moment of the Fourier magnitude distribution of f. The two contributions to this total risk are the approximation error and the estimation error. Approximation error refers to the distance between the target function and the closest neural network function of a given architecture and estimation error refers to the distance between this ideal network function and an estimated network function. Withn ∼ Cf(N/(d logN))1/2 nodes, the order of the bound on the mean integrated squared error is optimized to beO(Cf((d/N) logN)1/2). The bound demonstrates surprisingly favorable properties of network estimation compared to traditional series and nonparametric curve estimation techniques in the case thatd is moderately large. Similar bounds are obtained when the number of nodesn is not preselected as a function ofCf (which is generally not knowna priori), but rather the number of nodes is optimized from the observed data by the use of a complexity regularization or minimum description length criterion. The analysis involves Fourier techniques for the approximation error, metric entropy considerations for the estimation error, and a calculation of the index of resolvability of minimum complexity estimation of the family of networks.