Evolving String using Genetic Algorithm (Part 1 - The Experiment)
#post #python
This post originally appeared on Blog 2.0
There is this thing called genetic algorithm that simulates natural selection that guides the computer to evolve towards a certain behaviour. Let’s jump in and experiment a little bit. It would be a great python exercise anyway. This project is heavily inspired by Daniel Shiffman’s book The Nature of Code.
Ummm it’s just a fun project, so I guess it’s going to just be some casual writing.
First, let’s import some stuff
import random
from tqdm import tqdm
import pandas as pd
import itertools
Alright, let’s think about what should be included in this project.
A simple start is to try evolving a string of text. The characters are like nucleotides. We manually define some fitness functions that we can use to assign reproductive fitness to each individual in the population. Each generation, we somehow pull parents from a mating pool, cross over their genetic information in some way, do some mutation, and move on.
So… here are some brainstorm of what the experiment might look like
- population size =
1000
, etc. - mutation rate =
0.01
, etc. - options for target
- a long one
"one general law, leading to the advancement of all organic beings, namely, multiply, vary, let the strongest live and the weakest die."
- or a shorter and easier one?
"multiply, vary, let the strongest live and the weakest die"
- a long one
- The character set
- maybe
list(" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!?%&'(),-./:;")
- or
list(" abcdefghijklmnopqrstuvwxyz,.")
- maybe
- simulating fitness
- a straitforward one: total # of characters correct/total # of characters
- or some non-linear ones? like (Total # Characters Correct)^2
- or something as fancy as grammaticality in NLP? Pfff sorry I don’t want that computational cost but here looks like a promising post about it https://www.scribendi.ai/can-we-use-bert-as-a-language-model-to-assign-score-of-a-sentence/
- cross over strategy
- random crossing
- midpoint crossing
- …
Anyhow, let’s just say that we are only interested in the influence of population size, mutation rate, and cross over strategy. We’ll keep the other ones constant.
Before writing any code, I’ll take my turn making some hypotheses. Well these are just guesses; don’t take them seriously
- If the mutation rate is too high, evolution might fail to converge as random changes to the genotype overwrites most inherited information. If the mutation rate is too low, there will be little variation in the gene pool, making it hard for a shift in the population’s genetic makeup.
- A large population size allows more diverse genes in the population and thus a higher likelihood that a gene that gives individuals survival/reproductive advantage will be present in the population and get selected for. On the other hand, if the population size is too small, there’s less genetic variation and it would take more generations to reach higher fitness.
- Midpoint crossing allows higher genetic stability whereas random crossing introduces more variations?
The algorithm for each generation will roughly look like this:
- Declare constants
- The target configuration is modelled by a sentence (e.g. “one general law, leading to the advancement of all organic beings, namely, multiply, vary, let the strongest live and the weakest die.”)
- The set of genetic materials is " abcdefghijklmnopqrstuvwxyz,."
- Generate a population with random genes
- Selection
- Calculate the fitness of each individual in the population; a fitness function could look like: fitness = number of matching letters / length of the gene segment
- Assign high mating probability for individuals with higher fitness and store the information in a mating pool
- Sample parents from the mating pool
- Reproduction
- Cross over the genes of two parents
- if midpoint crossing
- Generate a random midpoint
- Cut out the segment before the midpoint from parent 1’s gene
- Cut out the segment after the midpoint from parent 2’s gene
- combine the two segments
- if random crossing
- randomly change each character with probability = mutation rate
- if midpoint crossing
- Cross over the genes of two parents
- Mutation: each letter in the child’s gene has a probability of changing to another letter
This algorithm is repeated and the evolved genes are checked against the target gene
Since randomness is a part of the procedure (to simulate random mutations), it is expected that the evolutionary outcome for each trial will be different. Multiple trials are performed using each set of parameters to get varied results. Since this experiment is ran on a computer, multiple trials can be performed to increase the precision of the data.
Alright, we define two classes, DNA
, which contains information and functions that a genome would have. We also need an Experiment
class. Since we are running experiments with different independent variables, it probably helps to create an instance for each experiment so that everything is contained.
class DNA():
def __init__(self, target, char_set):
self.target = target
self.genome = ''.join([random.choice(char_set) for i in range(len(target))])
self.char_set = char_set
def count_matches(self):
assert len(self.genome) == len(self.target)
count = 0
for i in range(len(self.genome)):
if self.genome[i] == self.target[i]:
count += 1
return count
def calc_fitness(self):
self.fitness = self.count_matches() / len(self.target)
# self.fitness = self.count_matches() ** 2
def crossover(self, partner, strategy = "midpoint"):
# MIDPOINT CROSS
if strategy == "midpoint":
midpoint = random.randint(0, len(self.genome))
child_genome = self.genome[:midpoint] + partner.genome[midpoint:]
# RANDOM CROSS
elif strategy == "random":
child_genome = ''
for i in range(len(self.genome)):
if random.random() < 0.5:
child_genome += self.genome[i]
else:
child_genome += partner.genome[i]
# Rien
else:
raise 'Invalid strategy'
child = DNA(self.target, self.char_set)
child.genome = child_genome
return child
def mutate(self, mutation_rate):
self.genome = ''.join([random.choice(self.char_set) if random.random() < mutation_rate else c for c in self.genome])
class Experiment():
def __init__(self, target, population_size, mutation_rate, char_set, cross_strat):
self.target = target
self.population_size = population_size
self.mutation_rate = mutation_rate
self.char_set = char_set
self.cross_strat = cross_strat
self.generation = 0
self.population = [DNA(self.target, self.char_set) for i in range(self.population_size)]
self.num_target_hits = 0
def evolve(self, num_generations):
avg_fitnesses = []
num_target_hits = []
generation_num = []
pbar = tqdm(range(num_generations), leave=False, mininterval=1)
for i in pbar:
# selection
mating_pool = []
for individual in self.population:
individual.calc_fitness()
n = (individual.fitness) * 100
# n = (individual.fitness)
mating_pool += [individual for j in range(int(n))]
# reproduce
self.population = []
for j in range(self.population_size):
parent1 = random.choice(mating_pool)
parent2 = random.choice(mating_pool)
child = parent1.crossover(parent2, strategy = self.cross_strat)
child.mutate(self.mutation_rate)
if child.genome == self.target:
self.num_target_hits += 1
self.population += [child]
# increase generation count
self.generation += 1
# output every 10 percent
# if num_generations % int(num_generations / 10) == 0:
# pbar.set_description_str(child.genome)
# construct stats
avg_fitnesses.append(self.avg_fitness())
num_target_hits.append(self.num_target_hits)
generation_num.append(self.generation)
# return stats
return avg_fitnesses, num_target_hits, generation_num
def avg_fitness(self):
# sample result
avg_fitness = 0
for individual in self.population:
individual.calc_fitness()
avg_fitness += individual.fitness
# print(individual.genome, individual.fitness)
avg_fitness /= len(self.population)
return avg_fitness
def print_samples(self, num_samples):
for i in range(num_samples):
sample = random.choice(self.population)
sample.calc_fitness()
print(f'"{sample.genome}", fitness = {sample.fitness}')
def print_summary(self, num_samples):
print(f'AVG FITNESS = {self.avg_fitness()}')
print(f'NUM HITS: {self.num_target_hits}')
print()
self.print_samples(num_samples)
Alright, now it’s time to come up with the specific independent variables we will test and put them in a dictionary. There are many different combinations. I’ll just let python combine them automatically.
independents = {
'population_size': [50, 100, 150, 200, 500, 1000],
'mutation_rate': [0.0005, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1],
'cross_strat': ['midpoint', 'random'],
'target': ["multiply, vary, let the strongest live and the weakest die"],
'char_set': [list(" abcdefghijklmnopqrstuvwxyz,.")]
}
and here we define the number of trials of each experiment and the number of generations we let each one evolve for
num_trials = 5
max_iters = 1000
get a sample and see what happened. This looks like a promising parameter for an experiment
experiments = []
keys, values = zip(*independents.items())
experiments = [dict(zip(keys, v)) for v in itertools.product(*values)]
experiments[0]
{'population_size': 50,
'mutation_rate': 0.0005,
'cross_strat': 'midpoint',
'target': 'multiply, vary, let the strongest live and the weakest die',
'char_set': [' ',
'a',
'b',
'c',
'd',
'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z',
',',
'.']}
Dataframe for storing results
results = pd.DataFrame({
'experiment_id': [],
'trial_num': [],
'population_size': [],
'mutation_rate': [],
'cross_strat': [],
'target': [],
'char_set': [],
'generation_num': [],
'avg_fitnesses': [],
'num_target_hits': [],
})
And the loop to run each experiment! We should be done
# experimental loop
for exp_id, experiment in tqdm(enumerate(experiments), total=len(experiments)):
for trial_i in range(num_trials):
e = Experiment(
population_size = experiment['population_size'],
mutation_rate = experiment['mutation_rate'],
cross_strat = experiment['cross_strat'],
target = experiment['target'],
char_set = experiment['char_set'],
)
avg_fitnesses, num_target_hits, generation_num = e.evolve(max_iters)
result = pd.DataFrame({
'generation_num': generation_num,
'avg_fitnesses': avg_fitnesses,
'num_target_hits': num_target_hits,
})
result['experiment_id'] = exp_id
result['trial_num'] = trial_i
result['population_size'] = experiment['population_size']
result['mutation_rate'] = experiment['mutation_rate']
result['cross_strat'] = experiment['cross_strat']
result['target'] = experiment['target']
result['char_set'] = ''.join(experiment['char_set'])
results = pd.concat([results, result])
100%|██████████| 84/84 [1:53:36<00:00, 81.15s/it]
Wouah that took two hours to run. I guess I only expected something less than half an hour, but here we have the results. Let’s take a look and save it somewhere
results
experiment_id | trial_num | population_size | mutation_rate | cross_strat | target | char_set | generation_num | avg_fitnesses | num_target_hits | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 50.0 | 0.0005 | midpoint | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 1.0 | 0.052069 | 0.0 |
1 | 0.0 | 0.0 | 50.0 | 0.0005 | midpoint | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 2.0 | 0.062069 | 0.0 |
2 | 0.0 | 0.0 | 50.0 | 0.0005 | midpoint | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 3.0 | 0.064828 | 0.0 |
3 | 0.0 | 0.0 | 50.0 | 0.0005 | midpoint | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 4.0 | 0.077241 | 0.0 |
4 | 0.0 | 0.0 | 50.0 | 0.0005 | midpoint | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 5.0 | 0.080690 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 83.0 | 4.0 | 1000.0 | 0.1000 | random | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 996.0 | 0.156500 | 0.0 |
996 | 83.0 | 4.0 | 1000.0 | 0.1000 | random | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 997.0 | 0.157448 | 0.0 |
997 | 83.0 | 4.0 | 1000.0 | 0.1000 | random | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 998.0 | 0.157207 | 0.0 |
998 | 83.0 | 4.0 | 1000.0 | 0.1000 | random | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 999.0 | 0.160914 | 0.0 |
999 | 83.0 | 4.0 | 1000.0 | 0.1000 | random | multiply, vary, let the strongest live and the... | abcdefghijklmnopqrstuvwxyz,. | 1000.0 | 0.158828 | 0.0 |
420000 rows × 10 columns
results.to_csv('results.csv')
Cool. Onward to Part 2: Analysis!