Modeling Loans’ Probability of Default Using Machine Learning

How to treat credit risk with advanced machine learning techniques

Introduction and Business Need

The analysis of the Probability of Default is one of the main tasks to be undertaken by financial institutions, where it is important to gauge the likelihood of a borrower defaulting before giving a certain loan. The lack of a correct methodology to calculate this probability may lead to high losses, create systemic risk and affect the whole economy of the financial institution. An accurate prediction of default risk in lending has been a crucial subject for banks and other lenders, but the availability of open source data and large datasets, together with advances in computational and algorithmic data analytics techniques have renewed interest in this risk prediction task.

About the Dataset and Data Analysis

It is fair to say that we have all sensed the major progression of data that took place during the last decade. Data has become the fuel of the 21st century, used to satisfy business requirements.

a. Explanatory Variables Cleansing and Preprocessing

After the application of dimensionality reduction on the 144 dataset’s variables, the final selection of the remaining 26 features shows:

b. Descriptive Analytics

Before going into the predictive models, it’s always fun to make some statistics in order to have a global view about the data at hand.
The first question that comes to mind would be regarding the default rate. For the used dataset, we find a high default rate of 20.3%, compared to an ordinary portfolio in normal circumstance (5–10%). Risky portfolios usually translate into high interest rates that are shown in Fig.1.

Fig.1: Interest Rate Distribution
Fig.2: Distribution of Invested Amount by Loan Purpose
Fig.3: Proportion of Fully Paid VS Charged-Off Loans Over Borrowers Home Ownership
Fig.4: Default Rate and Average Annual Income VS the Grade Level

c. Multicollinearity Assessment

Multicollinearity is mainly caused by the inclusion of a variable which is computed from other variables in the data set. It makes it hard to estimate precisely the regression coefficient and weakens the statistical power of the applied model. Multicollinearity can be detected with the help of the variance inflation factor (VIF), quantifying how much the variance is inflated.

About the Machine Learning Models

To predict the Probability of Default and reduce the credit risk, we applied two supervised machine learning models from two different generations.

Fig.5: Linear Regression Model VS Logistic Regression Model
Fig.6: Extreme Gradient Boost Model

Model Development and Comparison

a. Workflow

The below figure represents the supervised machine learning workflow that we followed, from the original dataset to training and validating the model. Python was used to apply this workflow since it’s one of the most efficient programming languages for data science and machine learning.

Fig.7: Supervised Machine Learning Workflow

b. Validation and Comparison

Benchmark researches recommend the use of at least three performance measures to evaluate credit scoring models, namely the ROC AUC and the metrics calculated based on the confusion matrix (i.e. accuracy, recall, f1-score …).

Table 1: Evaluation Metrics
Fig.8: Confusion Matrices for Each Model

Conclusion and Future Orientation

In this article, we’ve managed to train and compare the results of two well performing machine learning models, although modeling the probability of default was always considered to be a challenge for financial institutions. In order to further improve this work, it is important to interpret the obtained results, that will determine the main driving features for the credit default analysis. The final steps of this project are the deployment of the model and the monitor of its performance when new records are observed.

For more information about us, feel free to check our website.

VALOORES BI & AI is an open Analytics platform that spans all aspects of the Analytics life cycle, from Data to Discovery to Deployment.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store