Current paradigm for AI is entirely regression based, and lacks mathematical rigour and formalism. Formulating feedforward neural networks is simple:

Let f(x), f: R->[0,1] be an activation function. f(x), usually chosen to be the sigmoid function, but you can have any function here, like ReLu.

A neural network F with 1 hidden layer of N activation functions is a mapping of an M dimensional vector x, with M weight matrix W = {W_1, W_2,…W_M, W_final} and to a single real value.

F(x) = f( W_final^{T} (f(W_1^{T} x), … f(W_M^{T} x) ).

Learning is just optimizing W to some cost function C(y_bar, x), usually C(y_bar, x) = (y_bar – F(x))^{2} /2, for your training set (y_bar, x) pair (supervised learning). There is no theoretical reasoning behind this choice, other than the fact that this objective is differentiable for your gradient descent. Finding W is just using a variation of hill climbing or Newton’s method. C and f are both differentiable, so W is randomly initialized, and updated based on C’ and f’ on each sample.

You can generalize this to convolutional neural networks, resnets, or what ever shit-net architecture you’ve concocted on Keras.

This definition of learning feels somewhat arbitrary. You don’t have a guarantee of convergence using gradient descent, the manifold created by F has many local minima points. More over, you can’t figure out mathematically why x got classified as F(x), and what the defining region is. That’s the problem with using gradient descent as a learning rule. The problem I really have is that with this formulation, you are assuming that your data is stationary and follows the distribution defined by F. Why should you think that in the first place? Worse yet, learning y1, then immediately learning y2 will cause W to change unpredictably, causing you to forget the y1 you just learned.

In most practical situation, your data is non-stationary, you want your learning system to react in real time to changes in the world, with unexpected results.

I think we need to move away from this paradigm and make a new one. I’m looking at a completely different inspiration of learning, one from psychology and neuroscience. Psychology and neuroscience in the past 50 years has created the most amount of data, more so than any other science combined. The fundamental question I want to answer mathematically is how do brains give rise to minds?

What mechanistic laws give rise to intelligence that allow us to switch seamlessly from reading, writing, doing math, programming, walking, talking all at once in real time in a non-stationary world? There is a combinatorial explosion of optimization problems that we are able to instantly balance (eg. reading requires vision, object recognition, motor coordination, a plethora of natural language tasks, and abstract thought) while modern technology can’t even do single one properly.

My intuition is (which is coming from looking at data on memory, associative learning in psychology and coming up with mathematical models that can explain the data) that top-down-bottom up processing is the basis for biological intelligence. The data shows that we make predictions about the world, and through a series of math and mismatch we form categories via exemplars. I want a mathematical understanding of this process.

My simple mathematical formulation is: you have an initial empty set of categories. During learning, an input vector is presented. Your categories compete for activation via a winner take all process, the closest category within a threshold (in terms of the L1 norm) is chosen. If no such category exist, a new one is created. Learning is expanding the winning category to include this input. The top-down categories are activated by bottom up input vectors.

My categories are the set of axis aligned hyper rectangle in the dimension of your input. Presenting an input vector without a matching category creates a new unit hyper box with this input (mismatch, it’s an unexpected event, hence a new category). Learning is grow the hyper box (match). I am trying to construct my learning rules to avoid the problems with deep learning (and regression). The hyper rectangles tell you exactly what features cased a classification, presenting input y2 doesn’t cause you to forget y1 and the learning system converges. My hyper rectangles create general rules and exceptions to those rules.

Learning is finding the minimal set of general rules, and exceptions to those rules that perfectly fit your data. This is in stark contrast to regression based approaches that prefer a single model/function that avoids over fitting of your data.

submitted by /u/afro_donkey

[link] [comments]