This repository contains a Python script that builds, trains, and evaluates a machine learning model using the XGBoost classifier. The model is designed to predict the target column based on the other features in a dataset. It performs hyperparameter tuning, model evaluation, and saves the trained model for future use.
To run this script, you need Python version 3.11.9 installed. Additionally, the following Python libraries are required:
Install the required libraries using pip:
pip install scikit-learn polars yellowbrick xgboost hyperopt pandas pyarrow shap numpy==2.0.2
To assess the model for overfitting, we use a learning curve, which involves creating subsets of the data for training and observing how the model's evaluation metric changes. In this case, we use the Area Under the Curve (AUC) metric, which is suitable for our label imbalance problem.
As more training data is added, the model's cross-validation score improves, and the training score decreases slightly. This behavior indicates that the model is generalizing well and is not overfitting. The training score is not consistently at 1, and the cross-validation score is improving with more data, showing that the model is learning effectively from the increased data.
The model is neither underfitting nor overfitting. With more data, we could expect further improvement, indicating the model is robust and has the capacity to learn more.
The confusion matrix provides insights into the performance of the model by comparing true labels with predicted labels.
These metrics suggest that the model is more accurate at identifying cases without heart disease but still performs well in detecting heart disease cases.
The classification report provides detailed metrics, including precision, recall, and F1 score.
These high scores indicate that the model has a good balance of precision and recall, minimizing both false positives and false negatives.
The ROC curve demonstrates the performance of the model in distinguishing between classes. Both the training and testing ROC curves are close to 0.9, indicating that the model has high discriminative power and maintains similar performance on both training and test sets.
Using XGBoost's built-in feature importance, we identify the features that contribute the most to the model's predictions. Features such as ca
(number of major vessels colored by fluoroscopy), thalach
(maximum heart rate achieved), exang
(exercise-induced angina), and slope
(slope of the peak exercise ST segment) significantly impact the model's output.
These features have high normalized gain values, indicating their importance in the model's decision-making process.
The SHAP waterfall plot explains how individual features contribute to a specific prediction. For example, the feature ca
had a significant positive impact, suggesting a higher likelihood of heart disease. Other features, such as cp
(chest pain type), also influenced the prediction, showing the model's reasoning behind its decision.
The SHAP beeswarm plot shows the impact of each feature across all predictions. The feature ca
consistently showed a strong influence, where higher values of ca
increased the probability of having heart disease. Conversely, lower values of ca
were associated with a lower risk of heart disease, demonstrating how the model uses these features to make predictions.
Predicted: 0 | Predicted: 1 | |
---|---|---|
Actual: 0 | 95 | 7 |
Actual: 1 | 7 | 31 |