Genetic Algorithms – Learn Python for Data Science #6


Hello World, it’s Siraj! In this video, we’re going to use genetic programming to identify if some energy is gamma radiation or not. I’m getting angry. Gamma rays! Augh! Nah, I wish. Data science is a way of thinking about discovery. A data scientist needs to decide the right question to ask, like “Who’s the best candidate to vote for in the US election?,” then decide what dataset to use, like tweet history of candidates and past endorsements of each candidate, and lastly decide what machine learning model to use on the data to discover the right answer. ♫ Life goes on! ♫ With the right data, computing power, and machine learning model, you can discover a solution to any problem, but knowing which model to use can be challenging for new data scientists. There are so many of them! That’s where genetic programming can help. Genetic algorithms are inspired by the Darwinian process of natural selection, and they’re used to generate solutions to optimization and search problems. They have three properties: selection, crossover, and mutation. You have a population of possible solutions to a given problem and a fitness function. Every iteration, we evaluate how fit each solution is with our fitness function. Then we select the fittest ones and perform crossover to create a new population. We take those children and mutate them with some random modification and repeat the process until we get the fittest or best solution. So take this problem, for instance. Let’s say you want to take a road trip across a bunch of cities. What’s the shortest possible path you could take to hit up each city once and then return back to your home city? This is popularly called the “traveling salesman problem” in computer science, and we can use a genetic algorithm to help us solve it. Let’s look at some high-level Python code. We have the number of generations set to 5,000 and the population size set to 100. So we start by initializing our population using our size parameter. Each individual in our population represents a different solution path. Then, for each generation, we compute the fitness of each solution and store it in our population fitness array. Now we’ll perform selection by only taking the top 10% of the population which are our shortest road trips and produce offspring from them by performing crossover. Then you take those offspring randomly and repeat the process. As you can see in the animation, eventually we will get an optimal solution using this process, unlike Apple Maps. Alright, so how does this all fit into data science? Well, it turns out that choosing the right machine learning model and all the best hyperparameters for that model is itself an optimization problem. We’re going to use a Python library called TPOT, built on top of scikit-learn, that uses genetic programming to optimize our machine learning pipeline. So after formatting our data properly, we need to know what features to input to our model and how we should construct those features. Once we have those features, we’ll input them into our model to train on, and we’ll want to tune our hyperparameters, or tuning knobs, to get the optimal results. Instead of doing this all ourselves through trial and error, TPOT automates these steps for us with genetic programming, and it will output the optimal code for us when it’s done so we can use it later. So we’re going to create a classifier for gamma radiation using TPOT after installing our depencies, and then analyze the results. TPOT is built on the popular scikit-learn machine learning library, so we’ll want to make sure that we have that installed first. Then we’ll install pandas to help us analyze our data and numpy to perform math calculations. Our first step is to load our dataset. We’ll use pandas’ read_csv() method and set the parameter to the name of our saved CSV file. This is data collected from a scientific instrument called a “Cherenkov telescope” that measures radiation in the atmosphere and these are a bunch of features of whatever type of radiation it picks up. Thanks, Putin! Since the class object is already organized, we’ll shuffle our data to get a better result. The iloc() function of the telescope variable is pandas’ way of getting the positions in the index. And we’ll generate a sequence of random indices the size of our data using the permutation function of numpy’s ‘random’ submodule. Since all the instances are now randomly rearranged, we’ll just reset all these indices so they are ordered even though the data is now shuffled, using the reset_index() method of pandas with the drop parameter set to “True.” We’ll now let our ‘tele’ variable know what our two classes are by mapping both of them to an integer with the map() method. So ‘g’ for “gamma” is set to 0; ‘h’ for “hadron” is set to 1. Let’s store those ‘Class’ labels, which we’re going to predict, in a seperate variable called ‘tele_class’ and use the ‘values’ attribute to retrieve it. Before we train our model, we need to split our data into training and validation sets. We’ll use the train_test_split() method scikit-learn that we imported to create the indices for both. The parameters will be the size of our dataset. We want both sets to be arrays, so we’ll set the ‘stratify’ parameter to our array type. Then we’ll define what percent of our data we want to be training and testing with these last two parameters. We have a 75/25 split now in our data and we’re ready to train our model. We’ll initialize the ‘tpot’ variable using the ‘TPOT’ class with the number of generations set to 5. On a standard laptop with 4 gigs of RAM, it takes five minutes per generation to run so this will take about 25 minutes. This is so TPOT’s genetic algorithm knows how many iterations to run for, and we’ll set ‘verbosity’ to 2, which just means “Show a progress bar in terminals during the optimization process.” Then we can call our fit() method on our training data to let it perform optimization using genetic programming. The first parameter is the training feature set which we’ll retrieve from our ‘tele’ variable along the first access for every training index. The second variable is our training class set, which we’ll retrieve from our ‘tele’ variable like so. We can compute the testing error for validation using TPOT’s score() method with validation feature set as the first parameter and the validation class set as the second. We’ll export the computed Python code to the pipeline.py class using this method and name it in the parameter as a string. Let’s demo this thing. After training, we’ll see that after five generations, TPOT chose the gradient_boosting classifier as the most accurate machine learning model to use. It also shows the optimal hyperparameters like the learning rate and number of estimators for us. ♫ Yeah, boyyy! ♫ So, to break it down: with the right amount of data, computing power, and machine learning model, you can discover a solution to any problem. Genetic algorithms replicate evolution via selection, crossover, and mutation to find an optimal solution to a problem, and TPOT is a Python library that uses genetic programming to help you find the best model and hyperparameters for your use case. The winner of the coding challenge from the last video is Peter Mitrano. He added some great Deep Dream samples to his repository, and even Deep Dream’d my own video. Badass of the week! And the runner-up is Kyle Jordaan. Good job stitching all the Deep Dream’d frames together with one line of code The challenge for this video is to use TPOT and a climate change dataset that I’ll provide to predict the answer to a question you decide. This will be great practice in learning to think like a data scientist. Post your GitHub link in the comments and I’ll announce the winner next time. For now, I’ve got to stay fit to reproduce, so thanks for watching.

92 thoughts on “Genetic Algorithms – Learn Python for Data Science #6

  1. thank you for evolutionary algorithm & genetic programming video, i really like this section of artificial intelligence, do you think you could have neural nets be the population and there be cross-over between neural nets and mutation in the neural-layers until you find the best solution? I know neural nets are more popular right now but i hope evo. algo. and gen. prog. make a come-back

  2. I had trouble installing tpot, it was missing scipy dependency and I had to install in manually but that failed too 🙂 Finally this was the fix: http://stackoverflow.com/a/36158157/329869 and after this tpot was happy to install 🙂

  3. i had ——-Sudo pip install tpot—–but—-ImportError: cannot import name TPOT?

    https://github.com/llSourcell/genetic_algorithm_challenge/issues/1

  4. Hey Siraj! Really love your video's dude, you're doing a pretty good job at this kind of stuff

    Can you give an explanation or do a video about Deep Multilayer Perceptron? I stumble upon on an article about it and couldn't really get the main point

  5. Aren't Genetic Algorithms ineffective compared to other learning methods using gradient decent? Is it just to play around with it?
    Please make a/more Video(s) about Q-Learning. 🙂

  6. Siraj you are the most animated person i know lol. i can understand your videos without audio. i particularly like the pantamime of "Darwinian process" in second 53.

  7. just for updates: from tpot import tpot ; didn't find TPOT in there; from sklearn.model_selection import train_test_split; the latter avoids the deprecation warning of sklearn.cross_validation. still working on the rest. Thanks for the vid, nice topic to touch on.

  8. in demo.py, "training_indicss" should change to "training_indices".

    Siraj , i really love your video , big help for me, but please run your code ok then upload to github, that is very very important for other people

  9. I wish i know better so i can help tpot to get GPU support….
    This is so fucking amazing that i cant even find words.

  10. I found this challenge very hard, but I tried to make a couple of questions to show that I did think the assignment through.

    https://github.com/mickvanhulst/genAlgoSira <– Link to assignment on git

    Hope this is enough! 🙂

  11. Hi, Siraj! Thanks for the fun challenge. Check out my TPOT implementation here: https://github.com/nhrigby/genetic_algorithm_challenge

  12. Hey Sirajology,
    ML expert seem to be rare as unicorns. I am looking for collaborators for combining art and ML and need someone able to create "Deep dream" images. Any tips of people looking for a job?

    Best regards,
    Tony

  13. Thanks man~ Very cool and easy to understand. Better than reading a whole book. BTW, would u mind list the code of the shortest path example as well? Thx

  14. Love your videos! Can you do one on using NN for compression? https://www.quora.com/What-is-the-potential-of-neural-networks-in-data-compression

  15. Hey Siraj, I'm really enjoying your videos. I'm just getting in to machine learning and your videos have been one of my favorite sources. I just joined your Patreon. I know it is not much money so I wanted to comment here too with some encouragment. You are the first person I've ever supported on Patreon.

    I'm rooting for you to get fully funded. You are really good at what you are doing and it would be a shame if you had to stop prematurely. It is clear that you really enjoy it.

  16. I've been having issues understanding how to install dependencies. Is there any quick advice you can give me about how one would do that?

  17. You set me in the path of data science, and i've learned so much since then. Now, seeing your videos again I realize that I understand everything, every single line… and i didnt even know python back there! And its like what, 6 months from the day i've started??
    Thanks Siraj and thanks internet!!!!

  18. There's a big difference between Genetic Algorithms and Genetic Programming. GA uses an array as the data structure for an individual whereas GP uses a tree structure. Most people use GA because it's easier to code for. GP is more powerful due to the tree structure and can be applied to more complicated problems, but it requires more upfront work to code the base framework.

    Try not to use the two terms interchangeably as you are letting people believe they are the same thing when they're not.

  19. Try it with Holographic Research Strategy!

    https://www.researchgate.net/publication/222818603_Evaluation_of_catalyst_library_optimization_algorithms_Comparison_of_the_Holographic_Research_Strategy_and_the_Genetic_Algorithm_in_virtual_catalytic_experiments

    https://www.researchgate.net/publication/244321412_Holographic_research_strategy_for_catalyst_library_design

  20. sk-learn has an algorithm cheat sheet, and even a website you can go to and click on various algorithm types to decide which is best. Does Tensorflow have anything like that that you know of?

  21. I literally never comment on videos, but I had to today. I've powered through your data scientist playlist this afternoon, and I am absolutely overwhelmed with the quality and depth of the information you're giving us… you're like an ancient alien teaching us mere mortals about electricity… thanks so much

  22. I am having problem with this line of code:
    tele['Class']=tele['Class'].map({'g':0, 'h':1})

    It shows:
    KeyError: 'Class'

    Anyone knows why and how to fix it?

  23. You talk to fast and also umm also sir you should use simpler data set like maybe how to determine the best dog to buy or something

  24. Siraj Great job!
    you're explaining so concise and precise! I hardly can add anything to it.
    If you could make any video combining genetic programming with monte carlo tree search, it would be highly appreciated.
    MCTS is the subject of my thesis and I really wanna see if combining it with GP is posible or not.
    Thanks

  25. Do you have a video where you show how the output from TPOT is used to configure a machine learning analysis of the gamma dataset?

  26. Can you post the code for the traveling salesman genetic algorithm problem…I wanted to take a look at that out of curiosity?

  27. Is applying a genetic algorithm to statistical learning a good idea? Doesn't that just throw out any theory (and thought) to building sound methodology? I like his explanation of genetic algorithms but he lost me after suggesting this should be applied to statistical model building.

  28. "Data+Compute+ML You can solution any problem"… So, because that, I want to find solution for neural regeneration with those things. What can make new neurons? How? What differentiate them? Wonder if a ML can discober this…
    Can you make a video like this? (Sorry my english)

  29. Hei great videos first of all! I am very much new to this field of study and I am totally lost as to where I should start from and how to proceed. If possible please show me a proper way here. Thank you.

  30. Should tpot be imported with
    from tpot import TPOTClassifier
    or
    from tpot import TPOTRegressor
    depending whether you want classification or regression?
    "from tpot import TPOT" doesn't work.
    See https://epistasislab.github.io/tpot/using/

  31. Holy Smokes!!! It works. I only had to make two small changes to Siraj's code, and I believe that it was due to changes in "tpot". Kudos to the mighty Siraj!

    Printout:
    Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
    Optimization Progress: 7%|██▏ | 42/600 [05:17<1:39:08, 10.66s/pipeline]

    It increases my PC blower motor speed as it runs. Too bad I don't have 100 Cuda pipelines. Maybe I'll try AWS next time.

  32. For those facing import issues, try (depending upon problem statement):

    from tpot import TPOTClassifier
    OR
    from tpot import TPOTRegressor

    Doc Link: https://epistasislab.github.io/tpot/using/

  33. You are too fast to understand for a beginner …. You surely have great idea about things yet make such entertaining videos with more info in it which gets hard to understand and thus not that motivating …..

Leave a Reply

Your email address will not be published. Required fields are marked *