For years and years, I put food on the table developing conventional software.

One day, I started working on a problem that completely changed my career.

Here is that problem. Here is the solution.

Hopefully, this will also bring the best out of you.

🧵👇
One day I happened to run across a database that had a bunch of data splitting people into two different categories.

I needed to answer a simple question:

Given a person's information, could I determine which of the two categories it belonged to?

👇
I quickly realized that solving this problem needed a little more than what I knew.

I didn't know at the time that looking for an answer would lead me to a completely different world.

This was my very first Machine Learning problem.

Let's take a look at it!

👇
I'm not gonna take you through the ugliness of that original problem. Instead, I'll use a pretty good substitute that lends itself for a nicer explanation.

For the full story, check out this link: https://github.com/svpino/machine-learning/blob/master/titanic.ipynb

Below, you can see the full summary of the solution.

👇
1⃣ Start by downloading the dataset

We are going to work with a dataset that contains the information of 1,309 passengers of the Titanic.

Given the data of a passenger, our goal is to determine whether it survived or died during the wreck.

👇
2⃣ Spend some time analyzing the data

Using pandas, we can load the data and print some interesting information about it.

Understanding the data is a fundamental step to determine what would be a good approach to solve the problem.

👇
3⃣ Let's now prepare de datasets

On this problem, we are going to use a Decision Tree.

To train a model, we first need to prepare the datasets by splitting the data into two.

The result of this step will be the input to our decision tree.

👇
4⃣Building the Decision Tree

Using Scikit-Learn we can build a decision tree that's capable of predicting who survived and who died based on the data of a passenger.

With such a simple algorithm, we can get 86% accurate answers on the data of the training set! 🤯

👇
5⃣ Evaluating the results

After we train the model, we can use the test dataset we created before to evaluate it and ensure that we have something that works.

We can print a classification report to display a bunch of statistics about our model.

84% accurate answers 🕺💃!

👇
6⃣Displaying the tree

Finally, we can visualize the decision tree that was created during the training process.

Notice how these are simple conditions! The magic, of course, is that the algorithm came up with them automatically.

👇
If you have 15 minutes to spare, go through the solution that I published here: https://github.com/svpino/machine-learning/blob/master/titanic.ipynb.

If you have been wanting to start looking into something new and cool, this is your chance.

Today is the national "Let's Learn Machine Learning" Day. Take advantage of it!
You can follow @svpino.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: