Note-to-self: collections of disorganized thoughts for my own future re-consumption, updated when I feel like it.

1. Basic ideas:

Old-timey interpretation of machine learning is about processing: implementing a deterministic function from input to output (e.g. labels), and learning parameters in that function by optimization from the data, typically via gradient descent (see e.g. backpropagation).

Of course optimization can be thought of as max likelihood from a distribution, and you could do things in a Bayesian way. Then, supervised learning is about p(y|x) where x is data and y is a given layer representing labels, whereas unsupervised learning is about p(x,y) the joint distribution over inputs and internal variables.

Boltzmann machines are a way to do so, by "storing" the joint probability distribution as the equilibrium distribution of a stat mech model. Then, we can clamp down either the inputs or the hidden variables to get the conditional probability of the others. Concretely, a given network only samples from that distribution: for instance, getting the probability of an input given an internal state typically requires Metropolis-ing around the space of possible input variables.

2. Thoughts:

3. Biblio:


Mhaskar, LIao and Poggio show that, when dealing with compositional functions, deep nets require exponentially fewer parameters than shallow nets, although both have the same expressivity.

Lin and Tegmark connect the success of shallow neural nets with their ability to represent low-order polynomial Hamiltonians (i.e. shallow nets work because field theory works), and that of deep nets to the fact that many real processes are compositional and that flattening is often - but not always - costly; in particular, flattening polynomials has exponential costs. They claim that RG applies, but only as a form of supervised learning, i.e. we must know which macroscopic observables we want.

Tubiana and Monasson show conditions under which compositional representation appears in Restricted Boltzmann Machines, i.e. hidden units specialize to represent distinct traits (e.g. localized features of the characters you are trying to recognize) and only a handful of hidden units are activated for each character. This is very different from the completely distributed memories of a Hopfield net


VC dimension and Rademacher complexity may well have nothing to do with the ability of a network to generalize: in real state-of-the-art situations, you can randomize labels in the training set and still have current networks learn them, meaning that they can in essence memorize the whole set without detecting true features. Of course they then cannot generalize at all, while the same net with the same complexity trained on the real training set can.

It would seem that deep nets basically never get stuck in bad minima.