Genetic Algorithm vs. Stochastic Gradient Descent

Genetic Algorithm (GA) and Stochastic Gradient Descent (SGD) are well-known optimization methods and are used for learning in Neural Networks. There are various implementations of GA, however, most of them (e.g. Neat) are not directly comparable to SGD because these GA methods use point/localized mutations in their connections/weights. Geoffrey Hinton, in one of his videos (Lecture 3.4), mentioned that GA randomly perturbs one weight at a time. This makes GA very inefficient compared to backpropagation. In this article, we show that GA can learn efficiently if we mutate all the weights simultaneously provided that we choose a suitable mutation rate (similar to SGD which requires a suitable learning rate to work properly).

We cannot fairly compare learning methods that use localized mutations to SGD because in SGD all the weights within each layer are updated simultaneously making it more efficient. Also, a fair comparison between GA and SGD should use identical Neural Network architectures where the only difference is in the optimization step. Hence, we present a conceptually simple GA implementation that is closely comparable with SGD. In this implementation, instead of making a point or a localized mutation, we mutate all the network weights at once by adding a small random value to each weight. To implement this method, all you need to do is to follow these simple steps:

Implement a feed-forward Neural Network (just like you do for an SGD based Neural Network).
In the weight update step, mutate all the weights by adding a small random value to each weight.
Do feed-forward propagation twice, one with the mutated weights (child weights) and another with the original weights (parent weights).
Compare the performance of the parent weights to the child weights and keep the better ones for the next generation cycle.
Repeat steps from 2 to 4 until you reach convergence.

Complete python implementation of the above algorithm is less than a page long. You can find it at the end of this article which is a complete example that solves the XOR problem. This implementation is not canonical GA as it does not feature terms like chromosomes, crossover,…etc. However, conceptually, it is very truthful to the evolutionary principles and strips GA down to its bare minimum.

The following images are comparisons between GA and SGD. We chose a function that has an interesting error map with a narrow groove toward the global minimum to showcase a more interesting comparison. As you can see (Figure1-A), GA follows the path toward the global minimum like a drunken man and that is because GA is a stochastic method. What makes GA capable to reach the target is the selection method which we described in point 4 above. The selected (good) mutations are shown in Figure1-B. In contrast, SGD based optimization follows a strict rule. It steps toward the steepest descent even if the step is not in the direction of the global minimum. This, in some cases, makes learning slow or even unstable if we do not choose the learning rate carefully (Figure 2).

In general, GA based methods can reach the minima in a similar amount of steps to that of a naive SGD given that we choose a suitable mutation rate.

Another interesting point about the GA method we presented here is its simplicity and generality. In fact, the above method can be used for reinforcement learning with minimal modification, unlike SGD based methods where reinforcement learning requires conceptually different learning strategies and different implementations to make them work. Here we used GA to balance the Open AI’s gym Cartpole in 60 generations (Figure 3). All we did this time, differently, is to select the winning parent/child network at the end of each episode. The complete code implementation can be found at the end of this article. Please note that these GA methods are sensitive to the initial weights, some initialization converges faster than others and some initialization even fails to converge (this is true for SGD based learning too).

Figure 3: Open AI’s gym Cartpole balanced by a GA trained Neural Network.

Advantages of GA:

In a single iteration, the GA method described here needs two feed-forward propagations while the SGD method needs a feed-forward + a feed-backward propagation which is computationally more demanding.
Although we admit that a carefully optimized SGD method can reach the minima with fewer iterations, GA is a more parallelize-able method. Both GA and SGD are embarrassingly parallel in terms of matrix multiplications within each layer. However, with GA you can breed more than one child in parallel and select the best one. This can ultimately make GA reach the minima faster than SGD given that you have enough computational power to breed a massive number of children in parallel.
GA is a general-purpose learning method. It can be used for reinforcement learning, supervised learning, and feedback learning, all with minimal modification in its implementation.
The weight update rule we presented here is so simple making this method easy to implement in very complex network architectures that have feedback and sideway connections among its neurons.
GA can work with non-differentiable functions while SGD cannot.

Disadvantages of GA:

GA based methods generally require more iterations to reach the minima compared to well-optimized SGD based methods.
If the mutation rate is high, learning becomes unpredictable and noisy. If the mutation rate is low, learning becomes slow and can stick in local minima.
Bad performance for high dimensional and complex data patterns (this needs further investigation especially with CNN architecture which we have not investigated here).
GA is a population-based learning, therefore, we cannot take inspirations from this method to find more about the learning method used by our brains.

At Brainxyz, our main focus is not GA neither SGD because these methods are iterative in nature and are not suitable for real-time learning. We are currently working on Predictive Hebbian Unified Neurons (PHUN) which is a real-time learning algorithm. In the future, we will put more comparisons between various learning algorithms including PHUN, so please stay tuned with us.

Finally for those interested, here is a complete python code for solving the Xor problem using the GA method described in this article. The code only uses numpy library and its just a few lines of code! Conceptually, It cannot get any simpler than this.

import numpy as np
import matplotlib.pyplot as plt

## net structure and initialization (single layer neural network)
inp = 2; hidden = 5; out = 1;
mutation_rate = 0.03
w1 = np.random.uniform(-1,1, (inp, hidden))
w2 = np.random.uniform(-1,1, (hidden, out))

def feedforward(X, w1, w2):
    # feed forward propagation here it uses tanh as activation (can be sigmoid).
    z = np.tanh(X @ w1)
    yh = z @ w2
    return yh

def mutate(w, mutation_rate):
    # mutate the weights by adding small random values to them
    dw = np.random.uniform(-1,1, (w.shape)) * mutation_rate
    m_w = w + dw
    return m_w
       
def train(X, y, w1, w2, iterations):
    errs = []
    for i in range(iterations):
        
        m_w1 = mutate(w1, mutation_rate) #mutate input-hidden weights
        m_w2 = mutate(w2, mutation_rate) #mutate hidden-output weights
        
        yh1 = feedforward(X,   w1,   w2); #parent network
        yh2 = feedforward(X, m_w1, m_w2); #child network (mutated weights)
        
        err_parent = np.sum(np.abs(y - yh1.ravel())); #evaluate parent
        err_child =  np.sum(np.abs(y - yh2.ravel())); #evaluate child 
        
        if(err_child < err_parent): # select better
            w1 = m_w1;
            w2 = m_w2;   
            
        errs.append(err_child)        
    return errs, w1, w2

## usage example: XOR data
X=np.asarray([[0,0],[1,1],[0,1],[1,0]]);
y=np.asarray([0,0,1,1]);

iterations = 1000 
errs, m_w1, m_w2 = train(X, y, w1, w2, iterations)

## check the results  
yhat = feedforward(X, m_w1, m_w2) 
print('actual:', y)
print('prediction:', yhat.ravel())      
plt.figure(1)
plt.plot(errs)
plt.title('errors')

Code output:

actual: [0 0 1 1]
prediction: [0. 0.0023 1.0048 1.0047]

In the GitHub link below, you can find a more sophisticated implementation of the above code using object oriented programming principles. We also implemented additional flexibility for adding more layers and neurons to solve both Cartpole and XOR problems. https://github.com/hunar4321/Genetic_Algorithm

Edit (follow up):

This is not a “State of Art” implementation of GA. It is a toy example showcasing a specific comparison between the bare minimum versions of GA and SGD. Many scientists including deep learning pioneers like Yoshua Bengio encourage using toy examples because they can give valuable insights.
SGD based methods benefit greatly from momentum and adaptive learning. However, theoretically, GA based methods can enjoy these additions too, e.g. using momentum to accelerate toward good mutations.

I also briefly explained this algorithm in the video below at 21:50 timestamp: