Credit Scoring (III)
Asif Rajani
Business & People Leader | Finance & Risk Expert | Social Elevator Mechanic
“Medieval man was a cog in a wheel he did not understand; modern man is a cog in a complicated system he thinks he understands.”
Nassim Nicholas Taleb, The Bed of Procrustes
This is the third part of my article on credit scoring. The previous two can be found here:
To reduce the number of variables in a model and hence make it more concise and faster to evaluate one should make a variable selection. When using Logistic regression, the procedure to perform variable selection is based on the following statistical hypothesis test:
In the logistic regression, the test statistic is:
A chi-square distribution with 1 degree of freedom.
This test statistic will reject the null hypothesis H0 if the estimated coefficient ??i is high in absolute value compared to its standard error s.e.(??i).
领英推荐
Based on the value of the test statistic, we calculate the p-value, which is the probability of getting a more extreme value than the one observed. In practice, the p-value can be compared against a significance level and here are some common values used for decision:
Various variable selection procedures can now be used based on the p-value. An important point is that as the number of variables increases, the search space grows exponentially. The number of possible variable subsets is given by 2.exp(n)-1. Below you can find a graphical representation of the different possible subsets for a case with 4 variables:
To keep the search space under control some heuristic search procedures are required. Using the p-values, the variable space can be navigated in three possible ways:
Besides statistical significance at least three other criteria should be considered when selecting the variables:
Source: Credit Risk Analytics: Measurement Techniques, Applications, and Examples in SAS, 2016