"Survival Predictions on the Titanic: A Machine Learning Adventure"

Vinod Puttamadegowda
Oct 28, 2023
2 min read

Introduction

The sinking of the RMS Titanic is a tragic event that has captivated the world for over a century. In this blog post, we'll explore how to build a machine learning model to predict the survival of passengers on the Titanic. We will use the Kaggle Titanic dataset and a Random Forest Classifier to make these predictions.

Data Exploration

Data Loading

Let's start by loading the dataset and taking a quick look at its structure.

Understanding the Features

The dataset contains various features such as "Pclass," "Sex," "SibSp," "Parch," and more. Each feature provides valuable information that can help us predict passenger survival.

Data Preprocessing

Feature Selection

We selected "Pclass," "Sex," "SibSp," and "Parch" as our features. These features were chosen based on domain knowledge and their potential impact on survival.

Model Building

Splitting the Data

Before training our model, we split our data into training and validation sets. This allows us to assess the model's performance before making predictions on the test dataset.

The `random_state` parameter in data splitting ensures reproducibility by generating the same training and validation sets each time, which is crucial for debugging, consistent model evaluation, and facilitating learning in educational settings through a fixed seed for the random process.

Model Training

We fit the Random forest model on the training data set.

Model Evaluation

Model Validation

We evaluated our model on the validation set to measure its accuracy. We achieved 75% accuracy on the validation set.

Making Predictions

Final Model Training

After validation, we trained our model on the entire dataset for making predictions on the test dataset.

Contribution

Our model achieved an accuracy of 76.794%, which, when compared to the previous model with an accuracy of 77.511%, shows a slight decrease. However, this reduction is a result of a deliberate effort to prevent overfitting through a more robust three-way data split. In doing so, we have moved closer to a representation of real-world performance, demonstrating our commitment to ensuring the model's generalization and practical utility.

Conclusion

In this blog post, we explored the Titanic dataset, performed data preprocessing, built a Random Forest Classifier model, and made predictions on the test dataset. Our model demonstrated promising results with an accuracy of 75% on the validation set and an improved accuracy of 76.794% on the test set, closely aligning with real-world expectations.

Predicting survival on the Titanic is a classic machine learning task that provides an opportunity to learn and practice essential data science skills. We encourage you to continue experimenting with feature engineering, hyperparameter tuning, and other techniques to further improve the model's performance.

Thank you for joining us on this machine learning adventure, and we hope you enjoyed this blog post!