Artificial Intelligence | Brainxyz

Artificial Neural Networks | Interpolation vs. Extrapolation

Brainxyz — Mon, 07 Sep 2020 22:03:59 +0000

Artificial Neural Networks (ANNs) are powerful inference tools. They can be trained to fit complex functions and then used to predict new (unseen) data outside their training set. Fitting the training data is relatively easy for ANNs because of their Universal Approximation capability. However, that does not mean ANNs can learn the rules as we humans do. Here we aim to show how well a trained ANN, which fits its training data accurately, can generalize to new and unseen data. We categorize the unseen data into two types:

Data points within the training range (that can be interpolated).
Data points outside the training range (that can be extrapolated).

Take this example: | 1->1 | 2->4 | 3->? | 4->16 | 5->25 | 6->? | …

If we ask a human to predict 3->? and 6->? from the sequence above, most humans can answer correctly once they discover the rule that fits the set, in this example: y = X². The process of predicting 3->9 is called interpolation while the process of predicting 6->36 is called extrapolation. In this specific example, the distinction between interpolation and extrapolation is not important for humans because as humans find the power rule, they can apply it for any other number in the sequence. Even if we jump to 100->?, the answer is straight forward (100²). Nevertheless, the distinction between interpolation and extrapolation is quite important for ANNs and this sheds some insight into the difference between learning in humans versus learning in ANNs.

Here we demonstrate this difference by implementing a simple yet powerful ANN architecture with a single hidden layer and demonstrate its generalization capability for various parameter settings. We randomly assign and freeze the weights in the first layer and only train the weights in the final layer using a closed-form solution. This method is popularized under the name Extreme Learning Machine (ELM) and has some controversial origins. From our experience, this method trains quickly and gives very accurate results for low dimensional datasets that have shallow structures. We use Matlab to showcase the demonstration because its syntax is similar to algebra and we can implement the ANN from scratch in just a few lines of code. First, we start from the hello world of Machine Learning which is training ANN to solve the XOR problem. Below is the complete Matlab code:

X = [ 0, 0; 1, 1; 0, 1; 1, 0 ];          % Xor data
y = [ 0, 0, 1, 1 ];                      % targets

input = 2; neurons = 5;                  % parameters.
Wx = randn(input, neurons)*0.01;         % input-hidden weights (range ~ -0.01 to 0.01)
z = tanh(X * Wx);                        % 1st-Layer forward activation (tanh)
Wo = y * pinv(z');                       % Training output weights (closed form solution)
predictions = tanh(X * Wx) * Wo';        % Feedforward propagation | inference

disp(predictions)                        % display the predicted data

Believe it or not, the above is all you need to construct and train a single hidden layer ANN (no external libraries are needed). You can achieve comparable conciseness with python+numpy ( https://github.com/hunar4321/Simple_ANN/blob/master/ELM.py ). If you run this code in your IDE, it will output very close predictions to the targets i.e [0, 0, 1, 1]. What is more, the same lines of code can be used to approximate any function given you have enough neurons in the hidden layer. Though it should be noted that for very large inputs the inverse operation of the closed-form solution is computationally expensive. And in the case of very noisy datasets, over-fitting is the usual enemy.

Coming back to our quest which was to examine the ability of such trained networks to interpolate and extrapolate the unseen data, the XOR example is not useful for our quest because all the input data is used for training. A better example is to use these networks to solve the power rule: y = X² (our first example). Following is the complete Matlab code for our next example:

step = 2;                          % step size i.e the gap between the training sequence

X = [2:step:100]';                 % training data (even numbers) % 2, 4, 6,...
y = X.^2;                          % train targets

inp = 1; neurons = 100; 
Wx = randn(inp, neurons)*0.01;
z = tanh(X * Wx);
Wo = y' * pinv(z');
yhat = tanh(X * Wx) * Wo';

Xt = [1:step:121]';               % testing data (odd numbers) % 1, 3, 5,...
yt = Xt.^2;                       % test targets

prediction = tanh(Xt * Wx) * Wo'; % inference

% visualizations
figure; hold on; plot(X, y,'og'); plot(X, yhat, '*r'); hold off; 
legend('target', 'prediction'); title('y = X^2')
figure; hold on; plot(Xt, yt,'og'); plot(Xt, prediction, '*r'); hold off; 
legend('target', 'prediction'); title('y = X^2')
xticks = 1:10:121;
set(gca,'XTick',xticks)

The inputs for training are even numbers from 0 to 100 and their squares are the target outputs. As you can see from Figure 1, the network learned to fit and approximate the training data very well.

Input	Prediction	Target
10	99.23210	100
12	144.1864	144
14	196.7875	196
16	256.7168	256
18	324.1782	324

Table 1: Few examples of training inference

Figure 1: Fitting y = X² | training-set

We tested the same network, after training on the even numbers, to predict the unseen odd numbers from 1 to 121. Figure 2 shows the network’s ability to generalize to the unseen data. As you can note, ANN can predict the odd numbers that are within the training range very well, but cannot extrapolate beyond the training range i.e. When it reaches 101 it starts to give wrong predictions.

Input	Prediction	Target
11	120.7115	121
13	169.5647	169
15	225.8335	225
17	289.4791	289
19	360.8756	361

Table 2: Few examples of testing inference

Figure 2: Fitting y = X² | test-set

If we increase the number of the hidden neurons from 100 to 10000 we can see a steady improvement in ANN’s ability to predict the unseen odd numbers.

neurons = 1000
neurons = 10,000
neurons = 100,000

Figure 3: shows the effect of increasing the size of the hidden layer on ANN’s interpolation and extrapolation capabilities

Although increasing the number of hidden neurons improves the ANN’s capability for both interpolation and extrapolation, the network still fails to extrapolate values that are far from the training range. This suggests that ‘power rule’ cannot be modeled by ANN no matter how much is the size of our hidden layer. This is understandable because the weighted summation of the inputs (i.e the dot product) cannot model the multiplication process among the inputs, it can only approximate it for a specific range.

It is also worth noting that the interpolation ability of ANNs remains very good, even if we increase the step size (i.e the gap) between the training set numbers (Figure 4).

step = 4
step = 8
step = 12

Figure 4: shows the effect of increasing the size of the step (gap) in-between the training data on ANN’s interpolation capability

We can draw the following conclusions from our results above:

ANNs can fit the training set of our non-linear function, y = X², very well.
ANNs can fit the testing set of the function above provided that the test data are within the range of the training set, i.e ANNs are good at interpolation.
ANNs can interpolate the unseen data well even if the gaps between the training data are big.
ANNs are bad at fitting the test data that are far outside the range of the training set.
Increasing the number of neurons in the hidden layer improves both interpolation and extrapolation capabilities of the ANNs, however, the power rule cannot be modeled no matter how large the hidden layer is. The power rule can only be approximated within the specific range.

The last point is an important distinction between how ANNs learn to generalize and how we humans learn to generalize. However, the way we present the data to the ANNs might be important. For example, in deep learning, the data are usually presented as one-hot encoding which treats the samples as discrete categories instead of continuous values. Some new deep learning architectures like Transformers (e.g GPT-3) are shown to produce coherent text and generate compelling answers to questions suggesting there might be some rule learning capabilities. To our knowledge, GPT-3 was also bad at learning ‘multiplication’ (GPT-3 paper / Figure 3.10), but good at learning ‘addition’ and this sort of behavior resembles very much the behavior of a typical feed-forward ANN such as the one we showed in our example.

Another interesting factor about ANNs is that even though we increased the number of neurons beyond the number of the samples, our network did not suffer from the bad effects of over-fitting. Over-fitting is usually problematic when we have a non-representative training dataset. This usually happens when there is a lot of variance in the data due to the external noise. By external noise, we mean the non-interesting variance that is not part of the data structure itself. People who work with time-series signals that have a low signal to noise ratio (SNR), e.g. EGG and fMRI, usually prefer to use simpler Machine Learning methods such as SVMs and Ridge Regression because over-parameterized ANNs are usually driven toward fitting the external noise instead of the signal (Which is a very bad side effect of over-fitting).

In contrast, in the deep learning world, over-parameterized ANNs seems to be the norm, especially for those working in vision and NLP fields because their training datasets are usually large, representative, and clean. Since the training datasets in those fields are large and include a wide range of structural variation (i.e training datasets are representative of testing datasets), it is no surprise that deep learning networks like CNNs, GANs, and Transformers are capable of learning to classify and generate new variations of interpolate-able unseen data. This is similar to our example where the ANN was able to successfully predict the unseen odd numbers that were within the range of the training set.

In conclusion, we attribute the success of ANNs in vision and NLP fields to their good interpolation capability, especially when they are fed with large and representative training datasets. We also argue that rule learning and extrapolation beyond the training range are ANNs’ weak points.

At Brainxyz, our focus is on learning algorithms that are capable to learn and generalize in a controllable manner. We also aim for biological plausibility and efficiency. In the future articles, we will test and compare PHUN, our in-progress novel ML algorithm, with ANNs and other ML types. Please stay tuned with us.

Abbreviations:

ML: Machine Learning
ANNs: Artificial Neural Networks
CNN: Convolution al Neural Networks
SVMs: Support Vector Machine
GANs: Generative Adversarial Networks
GPT-3: Generative Pre-Training
NLP: Natural Language Processing
PHUN: Predictive Hebbian Unified Neurons.
fMRI: Functional Magnetic Resonance
EEG: ElectroEncephaloGram

The post Artificial Neural Networks | Interpolation vs. Extrapolation appeared first on Brainxyz.

Predict Pattern & Top Down Approach

Brainxyz — Fri, 28 Aug 2020 21:06:56 +0000

Suppose you are in an exam and you have to answer a set of True or False questions. Also, suppose you do not know the correct answers to any of the questions (because they seem nonsense to you). In this situation, all you can do is random guessing and normally your resulted score is around 50% by chance. Let’s say the teacher who marks your work gives you many more trials to answer the same set of questions. Also, he/she shows you your total score at the end of each trial. What will you do to reach the perfect score with a minimal number of trials?

One obvious strategy is a brute-force strategy. In this strategy, you randomly change all your answers in each trial until you get the full score by chance (Figure 1). With this strategy, if the number of questions is 10, your chance of guessing all the answers correctly by brute-force is 1 in 1024 trials. Your winning chances decrease exponentially as the number of questions increases.

Figure 1. An example of 10 True or False questions answered by brute-force (3 trials are shown).

The alternative strategy is to change only one question in each trial. If your total score increases, it means your change was correct. If not, it means your change was incorrect. With this simple strategy, you can guarantee to reach the perfect score in just 10 trials (if you have 10 questions). The number of trials required to reach the perfect score using this simple strategy is massively less than the brute-force strategy.

To have more insight into these kinds of problems, we performed some simulations and we figured out that it is possible to improve the chance of ‘reaching the full score in fewer trials’ even further if we try to maximize the information gain. This is possible by changing more than one question at a time. Maximizing information gain is a big topic and is detailed through “Shannon’s Information Theory”. In this article, we will not go into the theoretical details.

For educational and entertainment purposes, we turned part of our simulations to an Android game which we called “MineClear”. It is now available on Google’s Play Store. We also made a similar online version but with a different interface (Links at the end). Below are some screenshots of MineClear.
*Edit: MinClear is now discontinued but we made a similar online game where you need to find the hidden True and False pattern, just like the given example above. Here is the link to the online game.

Figure 2: Screenshots of MineClear game

In MineClear, you need to figure out all the ‘Mine’ and ‘Clear’ squares by flagging or unflagging them correctly. To know if you have made a correct guess you must peek at your score, which shows your current score. What makes the game challenging is that you can only peek at your score for a limited number of trials. The simple strategy described above can help you to pass the first level. However, in the more challenging levels, you have fewer trials than squares, therefore, you cannot figure out all the answers using that strategy because you will run out of your limited number of allowed peeks.

Another possible strategy is to change two squares at once, then peek at your score. For example, say you have flagged two squares (or changed two False choices to True). As a result, two possible changes might occur to your total score:

Your total score increases or decreases by a factor of two (+/-2). This means the changes you made were either “both correct” or “both incorrect”. This situation happens 50% of the time (Figure-3, situation-2).
Your total score does not change. This means one of the changes was correct, but the other was not. In this case, you need another trial to find out which change was the correct one. This happens 50% of the time (Figure-3, situation-1).

As you can see, following the above strategy, half of the time you need one trial to correctly guess about two squares, and half of the time you need two trials to correctly guess about both squares. Hence, this strategy can potentially increase your chance of winning by about 25% of our simple strategy. And this will be enough to pass the second level of MineClear most of the time.

Figure 3: Shows all the possible outcomes if you change 2 squares at once.

If you want to maximize the information gain even further, you have to change even more squares. For example, changing 4 squares at once gives you 16 guessing possibilities and 3 distinct situations (Figure 3). In the best-case scenario, you will be able to guess about 4 squares with only 1 trial and this happens 2 out of 16 times (situation-3 in Figure 4). And in the worst-case scenario, you might need up to 4 trials to guess about all the 4 squares correctly (situation-1 in Figure 4). Changing 4 squares at once can increase your chance of winning in the later levels of MineClear.

Please note that no strategy can guarantee your win all the time if the number of the trials is fewer than the number of the squares. Therefore, you need some luck on your side to win, but having a good strategy maximizes your chance of winning. Without a good strategy, you are out of luck too.

Figure 4: Shows all the possible outcomes if you change 4 squares at once.

We will not spoil all the possible strategies and will leave that to you to find ingenious ways to solve the problem, but the take-home message is: To maximize the information gain, you need to have the bird’s eye view of the entire grid at first then come down to the specifics. This is analogous to the top-down approach (Figure 5). With a top-down approach, you can visualize the problem as an inverted tree in which to figure out the values of the nodes at the bottom of your tree, first, you need to figure out the nodes at the top of the hierarchy. Changing a top node changes the nodes directly below it, but changing a node at the bottom will not change the nodes above it.

Figure 5: Top down approach where the top nodes have a wider field of view than the bottom ones

We think the game we presented here is a simplified simulation of many other real-life situations where you have to reach a goal with a minimal number of trials and errors. There are so many situations where you cannot get the specific feedback for the individual parts of your actions, instead, you get the feedback for your entire work as a whole. For instance, you might get rejected for a job interview without specific reasons. This rejection represents your total score. In this case, it is your job to find out which parts of you were responsible for the rejection and needs to be improved just like the game presented here.

In fact, in the evolutionary process, natural selection and the survival of the fittest is also a sort of this kind of feedback where each organism is assessed as a whole. This is why the mutation rate is generally slow. Because mutations happen randomly, widespread changes in the DNA structure of the next generations are usually fatal and can hardly improve the species because negative mutations cancel out the positive ones. On the other hand, small and incremental changes in DNA are not fatal. And with natural selection, the process of climbing the mount improbable or even reaching the global maxima is possible (similar to our first simple strategy). Minimizing the number of trials to reach the correct answer is probably the most valuable thing for our survival. Our DNAs cannot use the sophisticated top-down strategies we described here, but our brains can.

Edit: MineClear is no longer available on the Play store, instead you can play it online here under the name: Predict Hidden Pattern. Also, You can watch our new video about this topic and it’s relation to consciousness here: https://youtu.be/5qqxGwlUilU

The post Predict Pattern & Top Down Approach appeared first on Brainxyz.