History of credit scoring – If it works, use it!
A bit of historical context needed by all fellow modelers involved in the development and implementation of credit risk models in financial institutions, from the pen of one of the greatest experts in the field of credit scoring models
Source: A survey of credit and behavioral scoring: forecasting financial risk of lending to consumers by Lyn C. Thomas
History of credit scoring – If it works, use it!
Credit scoring is essentially a way of recognizing the different groups in a population when one cannot see the characteristic that separates the groups but only related ones. This idea of discriminating between groups in a population was introduced in statistics by Fisher (1936). He sought to differentiate between two varieties of iris by measurements of the physical size of the plants and to differentiate the origins of skulls using their physical measurements. David Durand (1941) was the first to recognize that one could use the same techniques to discrimen- nate between good and bad loans. His was a research project for the US National Bureau of Economic Research and was not used for any predictive purpose. At the same time some of the finance houses and mail order firms were having difficulties with their credit management. Decisions on whether to grant loans or send merchandise had been made judgmentally by credit analysts for many years. However, these credit analysts were being drafted into military service and there was a severe shortage of people with this expertise. So, the firms got the analysts to write down the rules of thumb they used to decide to whom to grant loans (Johnson, 1992). These rules were then used by non- experts to help make credit decisions — one of the first examples of expert systems. It did not take long after the war ended for some folk to connect these two events and to see the benefit of statistically derived models in lending decisions. The first consultancy was formed in San Francisco by Bill Fair & Earl Isaac in the early 1950s and their clients at that time were mainly finance houses retailers and mail order firms
The arrival of credit cards in the late 1960s made the banks and other credit card issuers realize the usefulness of credit scoring. The number of people applying for credit cards each day made it impossible both in economic and manpower terms to do anything but automate the lending decision. When these organizations used credit scoring, they found that it also was a much better predictor than any judgmental scheme and default rates would drop by 50% or more — see Myers and Forgy (1963) for an early report on such success or Churchill, Nevin and Watson (1977) for one from a decade later. The only opposition came from those like Capon (1982) who argued ‘that the brute force empiricism of credit scoring offends against the traditions of our society’. He felt that there should be more dependence on credit history, and it should be possible to explain why certain characteristics are needed in a scoring system and others are not. The event that ensured the complete acceptance of credit scoring was the passing of the Equal Credit Opportunity Acts (ECOA, 1975, 1976) in the US in 1975 and 1976. These outlawed discriminating in the granting of credit unless the discrimination could be statistically justified. It is not often that lawmakers provide long term employment for anyone but lawyers, but this ensured that credit scoring analysis was to be a growth profession for the next 25 years.
After the implementation of the Equal Credit Opportunities Acts, there were a number of papers critical of the discriminant analysis/re- gression approach (Eisenbeis, 1977, 1978). These criticised the fact the rule is only optimal for a small class of distributions (a point refuted by Hand, Oliver & Lunn (1996)). Others like Capon (1982) criticised the development and implementation of credit scoring systems in general because of the bias of the sample, its size, the fact that the system is sometimes over- ridden and the fact that there is no continuity in the score — so at a birthday someone could change their score by several points. These issues were aired again in the review by Rosen- berg and Gleit (1994). Empiricism has shown though that these scoring systems are very robust in most actual lending situations, a point made by Reichert et al. (1983) and reinforced by experience (Johnson, 1992).
This has proved to be the case and still is the case. In the 1980s the success of credit scoring in credit cards meant that banks started using scoring for their other products like personal loans, while in the last few years scoring has been used for home loans and small business loans. Also, in the 1990s the growth in direct marketing has led to the use of scorecards to improve the response rate to advertising campaigns. In fact, this was one of the earliest uses in the 1950s when Sears used scoring to decide to whom to send its catalogues (Lewis, 1992).
Advances in computing allowed other techniques to be tried to build scorecards. In the 1980s logistic regression and linear programming, the two main stalwarts of today’s card builders, were introduced. More recently, artificial intelligence techniques like neural networks have been piloted. At present the emphasis is on changing the objectives from trying to minimize the chance a customer will default on one particular product (so-called ‘default scoring’) to looking at how the firm can maximize the profit it can make from that customer (so-called ‘profit scoring’). Moreover, the original idea of estimating the risk of defaulting has been augmented by scorecards which estimate response (how likely is a consumer to respond to a direct mailing of a new product), usage (how likely is a consumer to use a product), retention (how likely is a consumer to keep using the product after the introductory offer period is over), attrition (will the consumer change to another lender), and debt management (if the consumer starts to become delinquent on the loan how successful are various approaches to prevent default).
Credit scoring nowadays is based on statistical or operational research methods. The statistical tools include discriminant analysis which is essentially linear regression, a variant of this called logistic regression and classification trees, sometimes called recursive partitioning algorithms. The Operational Research techniques include variants of linear programming. Most scorecard builders use one of these techniques or a combination of the techniques. Credit scoring also lends itself to a number of different non-parametric statistical and AI modelling approaches. Ones that have been piloted in the last few years include the ubiquitous neural networks, expert systems, genetic algorithms and nearest neighbor methods. It is interesting that so many different approaches can be used on the same classification problem. Part of the reason is that credit scoring has always been based on a pragmatic approach to the credit granting problem. If it works, use it! The object is to predict who will default not to give explanations for why they default or answer hypothesis on the relationship between default and other economic or social variables. That is what Capon (1982) considered to be one of the main objections to credit scoring in his critique of the subject.
So how are these various methods used?
领英推荐
A sample of previous applicants is taken, which can vary from a few thousand to as high as hundreds of thousands, (not a problem in an industry where firms often have portfolios of tens of millions of customers). For each applicant in the sample, one needs their application form details and their credit history over a fixed period — say 12 or 18 or 24 months. One then decides whether that history is acceptable, i.e., are they bad customers or not, where a definition of a bad customer is commonly taken to be someone who has missed three consecutive months of payments. There will be a number of customers where it is not possible to determine whether they are good or bad because they have not been customers long enough or their history is not clear. It is usual to remove this set of ‘intermediates’ from the sample. One question is what a suitable time horizon for the credit scoring forecast is — the time between the application and the good/bad classification. The norm seems to be twelve to eighteen months. Analysis shows that the default rate as a function of the time the customer has been with the organization builds up initially and it is only after twelve months or so (longer usually for loans) that it starts to stabilize. Thus, any shorter a horizon is underestimating the bad rate and not reflecting in full the types of characteristics that predict default. A time horizon of more than two years leaves the system open to population drift in that the distribution of the characteristics of a population change over time, and so the population sampled may be significantly different from that the scoring system will be used on. One is trying to use what are essentially cross-sectional models, i.e., ones that connect two snapshots of an individual at different times, to produce models that are stable when examined longitudinally over time. The time horizon — the time between these two snapshots — needs to be chosen so that the results are stable over time. Another open question is what proportion of ‘goods’ and ‘bads’ to have in the sample. Should it reflect the proportions in the population or should it have equal numbers of ‘goods’ and ‘bads’. Henley (1995) discusses some of these points in his thesis. Credit scoring then becomes a classification problem where the input characteristics are the answers to the application form questions and the results of a check with a credit reference bureau and the output is the division into ‘goods’ and ‘bads’.
?
Overview of behavioral scoring
Behavioral scoring systems allow lenders to make better decisions in managing existing clients by forecasting their future performance. The decisions to be made include what credit limit to assign, whether to market new products to these particular clients, and if the account turns bad how to manage the recovery of the debt. The extra information in behavioral scoring systems compared with (application) credit scoring systems is the repayment and ordering history of this customer. Behavioral scoring models split into two approaches — those which seek to use the credit scoring methods but with these extra variables added, and those which build probability models of customer behavior. The latter also split into two classes depending on whether the information to estimate the parameters is obtained from the sample of previous customers or is obtained by Bayesian methods which update the firm’s belief in the light of the customer’s own behavior. In both cases the models are essentially Markov chains in which the customer jumps from state to state depending on his behavior. In the credit scoring approaches to behavioral scoring one uses the credit scoring variables and includes others which describe the behavior. These are got from the sample histories by picking some point of time as the observation point. The time preceding this — say the previous 12 months — is the performance period and variables are added which describe what happened then — average balance, number of payments missed. etc. A time some 18 months or so after the observation point is taken as the performance point and the customer’s behavior by then is assessed as good or bad in the usual way. Hopper and Lewis (1992) give a careful account of how behavioral scoring systems are used in practice and how new systems can be introduced.
They advocate the Champion vs. Challenger approach where new systems are run on a subset of the customers and their performance compared with the existing system. This makes the point yet again that it takes time to recognize whether a scoring system is discriminating well. The choice of time horizon is probably even more critical for behavioral scoring systems than credit scoring systems. Behavioral scoring is trying to develop a longitudinal forecasting system by using cross-sectional data, i.e., the state of the clients at the end of performance period and at the end of the outcome period. Thus, the time between these periods will be crucial in developing robust systems. Experimentation (and data limitations) usually suggest a 12 or 18-month period. Some practitioners use a shorter period, say 6 months, and then build a second scoring system to estimate which sort of behavior at six months will lead to the client eventually defaulting and define this 6-month behavior as ‘bad’ in the main scorecard. One can use older data for the second scorecard while using almost current data for the main scorecard.?
Credit and behavioral scoring have become establishes as major tools in forecasting financial risk in consumer lending and in helping organization cope with the risk of default in consumer lending. Once an organization takes up statistically and OpResearch based credit scoring, it hardly ever returns to judgmental based ones (Lewis, 1992). In practice, the fears of Capon (1982) and the difficulties alluded to in Rosenberg and Gleit (1994) have been allayed. As scoring usage expands to newer area — mortgage scoring for example — there may be reasons why it should be combined with judgmental systems or ones based on ‘loan to value’ of the secured item, which traditionally has proved successful. The organization needs to identify what risk it wishes to protect against and whether scoring is the appropriate technique of quantifying that risk.
?
Can we expect the entry of the progressivist 'social justice' culture into the regulation of credit scoring
But there are still some social issues related to the practical use of credit scoring as a forecasting tool; for example, it’s illegal to use some characteristics — race, sex, religion — but that does not prevent some authors (Crook, 2000) suggesting that there are surrogate variables which mean scoring systems do discriminate in these areas. Other authors (Chandler & Ewert, 1976) argue the relationship of these banned characteristics with other allowed characteristics forces allows the very discrimination which one is seeking to avoid.
Progress in incorporating economic effects would mean scorecards would be more robust to changes in the economic environment and so could be used for longer time periods before having to be rebuilt. Profit scoring would allow organizations to have a tool that is more aligned to their overall objective than the present tools which estimate the risk of consumers defaulting. However, if these developments are successful there may well be major impacts on the credit industry and on consumers. For the industry, those with the best models of consumer behavior will make the best profit — so there will be strategic advantages in having models which best analyze the wealth of data coming through. Firms, who are confident in their models, will start cherry picking and going for the most profitable customers. The subsequent price changes will lead to a levelling of the profits, but it will also lead to a standardization between financial and retail organizations about the types of consumers they want. Thus, some people will be able to borrow from all and will be the target of most organizations, but there may be an underclass of consumers who cannot borrow — certainly not in the plastic card market — and who are not targeted for any marketing. With lending and retailing becoming more automated, these consumers will face growing disadvantages, and this may lead to some governments acting in the name of social justice.
Temporary Credit Manager
2 年Well .... But do you know Z score and EM score by E. Altman ?
创业,决策引擎
2 年What an awesome post!
Author. Consultant. Key Note Speaker. Career Coach. Instructor. Mentor. Friend.
2 年WOW. Thanks for sharing this information. Awesome data.