Machine Learning | Brainxyz

Artificial Life Simulation

Brainxyz — Thu, 08 Sep 2022 23:08:04 +0000

Recently, I made an educational simulation project on what is known as Particle Life to showcase how complexity can arise from simplicity. Particle Life is like Conway’s game of life but in Conway’s game of life the effect of the particles are confined to their surrounding neighbors only while in these simulations particles have effects on each other over longer distances. Also, the interactions rely on Newtonian like attraction/repulsion forces among the interacting particles.

A diverse set of self-organizing patterns. All emerged from four particle types interacting with each other.

In our case, four particle types (green, red, white, yellow) are interacting with each other. Each has a different attraction and repulsion property toward the other particle types. They start randomly and with time they aggregate together and move around giving rise to some interesting formations and behaviors. This produces some interesting self-organizing patterns where it looks life-like cells eating, chasing, or merging with each other.

I modified the algorithm to make it much simpler removing collision detection and distance squaring which has improved the performance and allowed testing thousands of particles in real-time. I also added the ability to explore various parameters in real-time. This has allowed me to discover some never-seen before patterns to emerge from some very simple rules. This simulation changed my mind that, after all, life like patterns are not so difficult to emerge especially during the early earth days where the primordial soup was filled with the necessary ingredients.

One thing to be noted is, the simulation here involved unidirectional attraction and repulsion too. This asymmetry, which doesn’t exist at the atomic level, leads to some interesting chasing behavior. However, one can easily imagine the emergence of the chasing behavior from symmetrical forces if we simulate a very large number of particles in a much larger space. For example, we might have a large protein that has more negative ions on one side than the other or we can imagine a closed space with a negatively charged membrane, a positive particle will move towards the membrane and might be followed by a negative charge imitating chasing. We, as a local observer, might model the chasing with asymmetrical force without the need to model the membrane. This is good news, because it means we can model complex biochemical reactions to some degree of accuracy with much less computational resources.

Complex patterns emerging from simple relations is an interesting topic but as a Neuroscientist, I am interested in the reversal process where our brain tries to untangle the complexities and re-model the relationships among its neuronal connections. Our brains ability to model these relations is what allows us to imagine and predict.

Human brain is able to model the relations among its neuronal connections

Simulation demos and a walk-through tutorial is available in this video:

You can find the project source code on GitHub: https://github.com/hunar4321/particle-life

You can also play with a live online demo here: https://hunar4321.github.io/particle-life/particle_life.html

The JavaScript code is as simple as this:

This project was inspired by Jeffery Ventrella’s Clusters http://www.ventrella.com/Clusters/. I don’t have access to Ventrella’s code but I guess the main difference of this project with the other particle life projects is that I didn’t implement collision detection and this made simulating thousands of particles possible in real-time.

Also, I added GUI controls to change the parameters in real-time this allows easy fine-tuning & exploration, hence, I was able to find some never-seen-before patterns emerge form some extremely simple models of relations. The code here is probably an order of magnitude simpler than any other Artificial Life codes out there because I started this code solely as an educational material for non-programmers and general audience to prove the point that complexity can arise from simplicity.

Another famous of example of complexity arising from simplicity is the Mendelbrot set:

Related topics: #artificial #game #simulator, Particle Life Simulation, Primordial Soup – Evolution, Conway’s game of life, Cellular automata, Fractals, Self organizing patterns, Mandelbrot, JavaScript programming.

The post Artificial Life Simulation appeared first on Brainxyz.

Simulation: Life as a Survival Optimization Problem

Brainxyz — Sat, 25 Sep 2021 19:22:49 +0000

As someone who came to the Machine Learning world from a Medical background, I couldn’t help not relating being stuck at a Local Maximum to other life situations. So I have decided to make a simulation project that helps to visualize this problem from a biological and also a political perspective where liberals and conservatives compete to reach the global maximum (Links at the end).

Increasing External Threats Gradually to aid reaching Global Maximum

The geographical representation is a very nice way to visualize and relate the topics. For instance, internal struggle and Existential threats can be seen as means to help in reaching the Global Maximum. Also, sexual reproduction can be seen as a way to share information between two Local Maxima which can help in reducing the search space.

One can even relate this problem to politics! For instance, you can think of a Conservative as someone with a small step changing-rate while a Liberal as someone with a large step adaptation-rate. Political systems like Anarchy can be seen as a situation where no one sticks to any Local Maxima whereas fascism can be seen as everyone sticks to a single Local Maximum.

Liberal vs. Conservative (Simulation)

We know that DNA based life replicates and reproduces exponentially but because of resource limitations it will hit a ceiling and due to the internal struggle it optimizes itself for surviving better, hence, the survival of the fittest is becoming the objective of life.

An illustrative video description of the simulations is here:

The purpose of this simulation is not to present the state-of-the-art algorithms but to bridge and link the terms that are used in the Machine Learning world and the ones that are used in the biological and the political world by making these toy simulations and easily digestible videos to both parties. I believe the Machine learning world can give so much to the other fields because life in essence is a survival optimization problem where everything is complicated by being stuck at a Local Maximum.

Here you can play with the simulation: https://simmer.io/@hunar/reaching-global-maxima

Simulation Code: https://github.com/hunar4321/Reaching-global-maximum

Important Note:
Some experts say that Local Maxima doesn’t matter in very high dimensional landscapes. This is true if the convergence speed doesn’t matter and also if all dimensions have equal weights. However, we know that is not the case, convergence speed always matters in a competitive world like ours as all Life forms are in a tough race for survival. Also, not all dimensions have the same impactful weight. Many dimensions can be ignored or they are already pruned or not accessible, therefore, the actual number of the plausible dimensions is much less than the available dimensions.

Related topics:

Other related Algorithms: Neat algorithm, Hill Climbing, Particle swarm optimization…,etc.

Other Optimization techniques that uses Calculus: Gradient Descent, Adam, Recursive Least Squares…, etc.

Other related terms: Local minimum & Global minimum (when minimizing the error)

Deep Learning techniques are good to avoid being stuck at local maximum as they use many layers and lots of data.

Other learning methods: Hebbian Learning, Winner takes it all (WTA), …etc. We at Brainxyz use: Predictive Hebbian Unified Neurons (PHUN)

The post Simulation: Life as a Survival Optimization Problem appeared first on Brainxyz.

Artificial Neural Networks | Interpolation vs. Extrapolation

Brainxyz — Mon, 07 Sep 2020 22:03:59 +0000

Artificial Neural Networks (ANNs) are powerful inference tools. They can be trained to fit complex functions and then used to predict new (unseen) data outside their training set. Fitting the training data is relatively easy for ANNs because of their Universal Approximation capability. However, that does not mean ANNs can learn the rules as we humans do. Here we aim to show how well a trained ANN, which fits its training data accurately, can generalize to new and unseen data. We categorize the unseen data into two types:

Data points within the training range (that can be interpolated).
Data points outside the training range (that can be extrapolated).

Take this example: | 1->1 | 2->4 | 3->? | 4->16 | 5->25 | 6->? | …

If we ask a human to predict 3->? and 6->? from the sequence above, most humans can answer correctly once they discover the rule that fits the set, in this example: y = X². The process of predicting 3->9 is called interpolation while the process of predicting 6->36 is called extrapolation. In this specific example, the distinction between interpolation and extrapolation is not important for humans because as humans find the power rule, they can apply it for any other number in the sequence. Even if we jump to 100->?, the answer is straight forward (100²). Nevertheless, the distinction between interpolation and extrapolation is quite important for ANNs and this sheds some insight into the difference between learning in humans versus learning in ANNs.

Here we demonstrate this difference by implementing a simple yet powerful ANN architecture with a single hidden layer and demonstrate its generalization capability for various parameter settings. We randomly assign and freeze the weights in the first layer and only train the weights in the final layer using a closed-form solution. This method is popularized under the name Extreme Learning Machine (ELM) and has some controversial origins. From our experience, this method trains quickly and gives very accurate results for low dimensional datasets that have shallow structures. We use Matlab to showcase the demonstration because its syntax is similar to algebra and we can implement the ANN from scratch in just a few lines of code. First, we start from the hello world of Machine Learning which is training ANN to solve the XOR problem. Below is the complete Matlab code:

X = [ 0, 0; 1, 1; 0, 1; 1, 0 ];          % Xor data
y = [ 0, 0, 1, 1 ];                      % targets

input = 2; neurons = 5;                  % parameters.
Wx = randn(input, neurons)*0.01;         % input-hidden weights (range ~ -0.01 to 0.01)
z = tanh(X * Wx);                        % 1st-Layer forward activation (tanh)
Wo = y * pinv(z');                       % Training output weights (closed form solution)
predictions = tanh(X * Wx) * Wo';        % Feedforward propagation | inference

disp(predictions)                        % display the predicted data

Believe it or not, the above is all you need to construct and train a single hidden layer ANN (no external libraries are needed). You can achieve comparable conciseness with python+numpy ( https://github.com/hunar4321/Simple_ANN/blob/master/ELM.py ). If you run this code in your IDE, it will output very close predictions to the targets i.e [0, 0, 1, 1]. What is more, the same lines of code can be used to approximate any function given you have enough neurons in the hidden layer. Though it should be noted that for very large inputs the inverse operation of the closed-form solution is computationally expensive. And in the case of very noisy datasets, over-fitting is the usual enemy.

Coming back to our quest which was to examine the ability of such trained networks to interpolate and extrapolate the unseen data, the XOR example is not useful for our quest because all the input data is used for training. A better example is to use these networks to solve the power rule: y = X² (our first example). Following is the complete Matlab code for our next example:

step = 2;                          % step size i.e the gap between the training sequence

X = [2:step:100]';                 % training data (even numbers) % 2, 4, 6,...
y = X.^2;                          % train targets

inp = 1; neurons = 100; 
Wx = randn(inp, neurons)*0.01;
z = tanh(X * Wx);
Wo = y' * pinv(z');
yhat = tanh(X * Wx) * Wo';

Xt = [1:step:121]';               % testing data (odd numbers) % 1, 3, 5,...
yt = Xt.^2;                       % test targets

prediction = tanh(Xt * Wx) * Wo'; % inference

% visualizations
figure; hold on; plot(X, y,'og'); plot(X, yhat, '*r'); hold off; 
legend('target', 'prediction'); title('y = X^2')
figure; hold on; plot(Xt, yt,'og'); plot(Xt, prediction, '*r'); hold off; 
legend('target', 'prediction'); title('y = X^2')
xticks = 1:10:121;
set(gca,'XTick',xticks)

The inputs for training are even numbers from 0 to 100 and their squares are the target outputs. As you can see from Figure 1, the network learned to fit and approximate the training data very well.

Input	Prediction	Target
10	99.23210	100
12	144.1864	144
14	196.7875	196
16	256.7168	256
18	324.1782	324

Table 1: Few examples of training inference

Figure 1: Fitting y = X² | training-set

We tested the same network, after training on the even numbers, to predict the unseen odd numbers from 1 to 121. Figure 2 shows the network’s ability to generalize to the unseen data. As you can note, ANN can predict the odd numbers that are within the training range very well, but cannot extrapolate beyond the training range i.e. When it reaches 101 it starts to give wrong predictions.

Input	Prediction	Target
11	120.7115	121
13	169.5647	169
15	225.8335	225
17	289.4791	289
19	360.8756	361

Table 2: Few examples of testing inference

Figure 2: Fitting y = X² | test-set

If we increase the number of the hidden neurons from 100 to 10000 we can see a steady improvement in ANN’s ability to predict the unseen odd numbers.

neurons = 1000
neurons = 10,000
neurons = 100,000

Figure 3: shows the effect of increasing the size of the hidden layer on ANN’s interpolation and extrapolation capabilities

Although increasing the number of hidden neurons improves the ANN’s capability for both interpolation and extrapolation, the network still fails to extrapolate values that are far from the training range. This suggests that ‘power rule’ cannot be modeled by ANN no matter how much is the size of our hidden layer. This is understandable because the weighted summation of the inputs (i.e the dot product) cannot model the multiplication process among the inputs, it can only approximate it for a specific range.

It is also worth noting that the interpolation ability of ANNs remains very good, even if we increase the step size (i.e the gap) between the training set numbers (Figure 4).

step = 4
step = 8
step = 12

Figure 4: shows the effect of increasing the size of the step (gap) in-between the training data on ANN’s interpolation capability

We can draw the following conclusions from our results above:

ANNs can fit the training set of our non-linear function, y = X², very well.
ANNs can fit the testing set of the function above provided that the test data are within the range of the training set, i.e ANNs are good at interpolation.
ANNs can interpolate the unseen data well even if the gaps between the training data are big.
ANNs are bad at fitting the test data that are far outside the range of the training set.
Increasing the number of neurons in the hidden layer improves both interpolation and extrapolation capabilities of the ANNs, however, the power rule cannot be modeled no matter how large the hidden layer is. The power rule can only be approximated within the specific range.

The last point is an important distinction between how ANNs learn to generalize and how we humans learn to generalize. However, the way we present the data to the ANNs might be important. For example, in deep learning, the data are usually presented as one-hot encoding which treats the samples as discrete categories instead of continuous values. Some new deep learning architectures like Transformers (e.g GPT-3) are shown to produce coherent text and generate compelling answers to questions suggesting there might be some rule learning capabilities. To our knowledge, GPT-3 was also bad at learning ‘multiplication’ (GPT-3 paper / Figure 3.10), but good at learning ‘addition’ and this sort of behavior resembles very much the behavior of a typical feed-forward ANN such as the one we showed in our example.

Another interesting factor about ANNs is that even though we increased the number of neurons beyond the number of the samples, our network did not suffer from the bad effects of over-fitting. Over-fitting is usually problematic when we have a non-representative training dataset. This usually happens when there is a lot of variance in the data due to the external noise. By external noise, we mean the non-interesting variance that is not part of the data structure itself. People who work with time-series signals that have a low signal to noise ratio (SNR), e.g. EGG and fMRI, usually prefer to use simpler Machine Learning methods such as SVMs and Ridge Regression because over-parameterized ANNs are usually driven toward fitting the external noise instead of the signal (Which is a very bad side effect of over-fitting).

In contrast, in the deep learning world, over-parameterized ANNs seems to be the norm, especially for those working in vision and NLP fields because their training datasets are usually large, representative, and clean. Since the training datasets in those fields are large and include a wide range of structural variation (i.e training datasets are representative of testing datasets), it is no surprise that deep learning networks like CNNs, GANs, and Transformers are capable of learning to classify and generate new variations of interpolate-able unseen data. This is similar to our example where the ANN was able to successfully predict the unseen odd numbers that were within the range of the training set.

In conclusion, we attribute the success of ANNs in vision and NLP fields to their good interpolation capability, especially when they are fed with large and representative training datasets. We also argue that rule learning and extrapolation beyond the training range are ANNs’ weak points.

At Brainxyz, our focus is on learning algorithms that are capable to learn and generalize in a controllable manner. We also aim for biological plausibility and efficiency. In the future articles, we will test and compare PHUN, our in-progress novel ML algorithm, with ANNs and other ML types. Please stay tuned with us.

Abbreviations:

ML: Machine Learning
ANNs: Artificial Neural Networks
CNN: Convolution al Neural Networks
SVMs: Support Vector Machine
GANs: Generative Adversarial Networks
GPT-3: Generative Pre-Training
NLP: Natural Language Processing
PHUN: Predictive Hebbian Unified Neurons.
fMRI: Functional Magnetic Resonance
EEG: ElectroEncephaloGram

The post Artificial Neural Networks | Interpolation vs. Extrapolation appeared first on Brainxyz.

Genetic Algorithm vs. Stochastic Gradient Descent

Brainxyz — Tue, 01 Sep 2020 21:48:56 +0000

Genetic Algorithm (GA) and Stochastic Gradient Descent (SGD) are well-known optimization methods and are used for learning in Neural Networks. There are various implementations of GA, however, most of them (e.g. Neat) are not directly comparable to SGD because these GA methods use point/localized mutations in their connections/weights. Geoffrey Hinton, in one of his videos (Lecture 3.4), mentioned that GA randomly perturbs one weight at a time. This makes GA very inefficient compared to backpropagation. In this article, we show that GA can learn efficiently if we mutate all the weights simultaneously provided that we choose a suitable mutation rate (similar to SGD which requires a suitable learning rate to work properly).

We cannot fairly compare learning methods that use localized mutations to SGD because in SGD all the weights within each layer are updated simultaneously making it more efficient. Also, a fair comparison between GA and SGD should use identical Neural Network architectures where the only difference is in the optimization step. Hence, we present a conceptually simple GA implementation that is closely comparable with SGD. In this implementation, instead of making a point or a localized mutation, we mutate all the network weights at once by adding a small random value to each weight. To implement this method, all you need to do is to follow these simple steps:

Implement a feed-forward Neural Network (just like you do for an SGD based Neural Network).
In the weight update step, mutate all the weights by adding a small random value to each weight.
Do feed-forward propagation twice, one with the mutated weights (child weights) and another with the original weights (parent weights).
Compare the performance of the parent weights to the child weights and keep the better ones for the next generation cycle.
Repeat steps from 2 to 4 until you reach convergence.

Complete python implementation of the above algorithm is less than a page long. You can find it at the end of this article which is a complete example that solves the XOR problem. This implementation is not canonical GA as it does not feature terms like chromosomes, crossover,…etc. However, conceptually, it is very truthful to the evolutionary principles and strips GA down to its bare minimum.

The following images are comparisons between GA and SGD. We chose a function that has an interesting error map with a narrow groove toward the global minimum to showcase a more interesting comparison. As you can see (Figure1-A), GA follows the path toward the global minimum like a drunken man and that is because GA is a stochastic method. What makes GA capable to reach the target is the selection method which we described in point 4 above. The selected (good) mutations are shown in Figure1-B. In contrast, SGD based optimization follows a strict rule. It steps toward the steepest descent even if the step is not in the direction of the global minimum. This, in some cases, makes learning slow or even unstable if we do not choose the learning rate carefully (Figure 2).

A) GA – All mutations are shown

B) GA – Only good mutations are shown

Figure 1: Shows the steps of GA toward the global error minimum (deep blue area)

A) SGD – learning rate: 0.001

B) SGD – learning rate 0.003

Figure 2: Shows SGD steps toward the global minimum and its sensitivity to the learning rate which makes SGD flows a zigzag path. In this, case learning rates more than 0.005 fails to converge (diverge instead)

In general, GA based methods can reach the minima in a similar amount of steps to that of a naive SGD given that we choose a suitable mutation rate.

Another interesting point about the GA method we presented here is its simplicity and generality. In fact, the above method can be used for reinforcement learning with minimal modification, unlike SGD based methods where reinforcement learning requires conceptually different learning strategies and different implementations to make them work. Here we used GA to balance the Open AI’s gym Cartpole in 60 generations (Figure 3). All we did this time, differently, is to select the winning parent/child network at the end of each episode. The complete code implementation can be found at the end of this article. Please note that these GA methods are sensitive to the initial weights, some initialization converges faster than others and some initialization even fails to converge (this is true for SGD based learning too).

Figure 3: Open AI’s gym Cartpole balanced by a GA trained Neural Network.

Advantages of GA:

In a single iteration, the GA method described here needs two feed-forward propagations while the SGD method needs a feed-forward + a feed-backward propagation which is computationally more demanding.
Although we admit that a carefully optimized SGD method can reach the minima with fewer iterations, GA is a more parallelize-able method. Both GA and SGD are embarrassingly parallel in terms of matrix multiplications within each layer. However, with GA you can breed more than one child in parallel and select the best one. This can ultimately make GA reach the minima faster than SGD given that you have enough computational power to breed a massive number of children in parallel.
GA is a general-purpose learning method. It can be used for reinforcement learning, supervised learning, and feedback learning, all with minimal modification in its implementation.
The weight update rule we presented here is so simple making this method easy to implement in very complex network architectures that have feedback and sideway connections among its neurons.
GA can work with non-differentiable functions while SGD cannot.

Disadvantages of GA:

GA based methods generally require more iterations to reach the minima compared to well-optimized SGD based methods.
If the mutation rate is high, learning becomes unpredictable and noisy. If the mutation rate is low, learning becomes slow and can stick in local minima.
Bad performance for high dimensional and complex data patterns (this needs further investigation especially with CNN architecture which we have not investigated here).
GA is a population-based learning, therefore, we cannot take inspirations from this method to find more about the learning method used by our brains.

At Brainxyz, our main focus is not GA neither SGD because these methods are iterative in nature and are not suitable for real-time learning. We are currently working on Predictive Hebbian Unified Neurons (PHUN) which is a real-time learning algorithm. In the future, we will put more comparisons between various learning algorithms including PHUN, so please stay tuned with us.

Finally for those interested, here is a complete python code for solving the Xor problem using the GA method described in this article. The code only uses numpy library and its just a few lines of code! Conceptually, It cannot get any simpler than this.

import numpy as np
import matplotlib.pyplot as plt

## net structure and initialization (single layer neural network)
inp = 2; hidden = 5; out = 1;
mutation_rate = 0.03
w1 = np.random.uniform(-1,1, (inp, hidden))
w2 = np.random.uniform(-1,1, (hidden, out))

def feedforward(X, w1, w2):
    # feed forward propagation here it uses tanh as activation (can be sigmoid).
    z = np.tanh(X @ w1)
    yh = z @ w2
    return yh

def mutate(w, mutation_rate):
    # mutate the weights by adding small random values to them
    dw = np.random.uniform(-1,1, (w.shape)) * mutation_rate
    m_w = w + dw
    return m_w
       
def train(X, y, w1, w2, iterations):
    errs = []
    for i in range(iterations):
        
        m_w1 = mutate(w1, mutation_rate) #mutate input-hidden weights
        m_w2 = mutate(w2, mutation_rate) #mutate hidden-output weights
        
        yh1 = feedforward(X,   w1,   w2); #parent network
        yh2 = feedforward(X, m_w1, m_w2); #child network (mutated weights)
        
        err_parent = np.sum(np.abs(y - yh1.ravel())); #evaluate parent
        err_child =  np.sum(np.abs(y - yh2.ravel())); #evaluate child 
        
        if(err_child < err_parent): # select better
            w1 = m_w1;
            w2 = m_w2;   
            
        errs.append(err_child)        
    return errs, w1, w2

## usage example: XOR data
X=np.asarray([[0,0],[1,1],[0,1],[1,0]]);
y=np.asarray([0,0,1,1]);

iterations = 1000 
errs, m_w1, m_w2 = train(X, y, w1, w2, iterations)

## check the results  
yhat = feedforward(X, m_w1, m_w2) 
print('actual:', y)
print('prediction:', yhat.ravel())      
plt.figure(1)
plt.plot(errs)
plt.title('errors')

Code output:

actual: [0 0 1 1]
prediction: [0. 0.0023 1.0048 1.0047]

In the GitHub link below, you can find a more sophisticated implementation of the above code using object oriented programming principles. We also implemented additional flexibility for adding more layers and neurons to solve both Cartpole and XOR problems. https://github.com/hunar4321/Genetic_Algorithm

Edit (follow up):

This is not a “State of Art” implementation of GA. It is a toy example showcasing a specific comparison between the bare minimum versions of GA and SGD. Many scientists including deep learning pioneers like Yoshua Bengio encourage using toy examples because they can give valuable insights.
SGD based methods benefit greatly from momentum and adaptive learning. However, theoretically, GA based methods can enjoy these additions too, e.g. using momentum to accelerate toward good mutations.

I also briefly explained this algorithm in the video below at 21:50 timestamp:

The post Genetic Algorithm vs. Stochastic Gradient Descent appeared first on Brainxyz.