Artificial Neural Network for credit risk modeling.
Shashi Dhungel
AI | ML | Analytics | Data Science | Software | Product Development | Optimization | Leadership
For the love of data science.
I ended up being a datacrat because I enjoy finding mathematical representations to mundane problems. “If you do what you love, you’ll never work a day in your life,” said Marc Anthony. I believe it is true. My work allows me to do what I love doing. I love making estimates and predictions. Some estimates are right and some estimates are wrong. I do that for a living. I also live to do that in my everyday life. Many of us do it unconsciously. We make estimates and predictions every day. For example, how much water do I use in the shower everyday? What much $ worth of gas I buy every month? How long will it take me to get there from here? We are confronted with questions like this in our life everyday.
US Energy Information Administration (EIA) makes a lot of forecasts about long term and short term oil and energy prices. More often than not these forecasts end up being wrong. I chose the EIA example to highlight the usefulness of models than in their actual predictions.
“Essentially, all models are wrong, but some are useful.” - George Box.
One can begin to formulate mathematical equations in many ways to answer the question like the ones above. There are many ways to tackle mundane questions using mathematics. If you practice enough you will get it. In Malcolm Gladwell's words you require 10,000 hours of practicing before you become good at it.
The company I work for was moving from one part of the city to another. The new location will be saving approximately 4 miles each way off my daily commute! I chose miles to represent the change for one good reason. I could have chosen the time saved to express my excitement for the change but I chose miles. For some people four miles takes five minutes, for others it may take 15 minutes and it will be different for many more. But the distance of 4 miles is 4 miles for all. One of the early lessons I learned in my data science career is that choosing the right data helps a lot in data story telling.
It appears that moving a corporate office with about 200 employees requires a lot of planning. Anyways, this corporate office move gave me an opportunity to work from home the whole week.
I had one goal in mind - build an artificial neural network from scratch in Python.
I was excited!
A week deep in artificial neural net.
I was interested in building a neural net that can predict with better accuracy than a logistic regression model, decision trees and some boosting methods that I have been using regularly in my work. I have built neural nets in R in the past. I was aware of at least two hyper parameters that I could tune and their effects in stabilizing the graph. The neural net I had built had not amazingly outperformed the logistic regression model for the dataset at that time. I chose logistic regression and quietly forgot about the neural net. One of the big reasons for choosing logistic regression was its simplicity. Data Science is often viewed as a ‘black box.’ Making complex and intricate models is not the place to begin.
But this time I was quite determined. Over the next six days I worked out of the kitchen island in my house. Armed with a pot of coffee every morning I was ready to start the day at around 7:45 AM and working through 5:00 PM religiously every day of the week. I ended up spending more than 70 hours in the week that ended up becoming a six day work-week. I was in full force trying to understand and build an artificial neural network to predict credit risk. Although I was working from home I actually spent less time with my family – cooking/eating dinner – helping with homework and household chores. I found myself reading about neural nets in all settings.
Internet is the cornucopia of learning resources!
Thanks to the open source ideas and the internet, freely available learning resources can cater to the needs of anyone. Stackoverflow, Stackexchange, Quora, Youtube and MITOpen Courseware all provided a lot of fuel to help me understand and build the artificial neural net. There were numerous other webpages, tutorials and cheat sheets that I cruised through. The online community of Python developers has good documentation on python and the libraries. I ended up spending a lot of time coding in python using Jupyter notebook. I had to use pandas, numpy and tensorflow vigorously. Matplotlib library was very useful in visualizing data in several steps. This exercise required me to revisit partial derivatives, chain rule, slope and tangents and matrix operations among others. There will always be some linear algebra in data science and ANN uses it too. Sigmoid function, randomness and probability estimates are ubiquitous and I found ANN to have more than a heatlhy dose of them.
Casting credit risk problem into image recognition problem.
Tensorflow is one of the most popular libraries used to build deep neural networks to predict image classes. Tensorflow was tailor designed to solve image classification problems. While the problem I am trying to solve is also a classification problem and it has only two classes it was difficult for me to cast it into a digit recognition problem. If I can however, formulate the credit risk problem in a matrix form then I can use the same matrix operations to minimize the loss and increase the accuracy of the function that is generating the result in image classification. I have less to do with what the function is and more to do with how far off I end up from the expected value. So if I can make a neural net that gets closer and closer to the expected value through the repetitive process then I have found a neural net that will work in my setting. There is a chance I will end up with a better model to predict credit risk than what I have in hand. The models I build are between 55 and 60 percent accurate. It is quite difficult to predict human behavior using socio economic data.
I concluded that if I can build a matrix of my data in accordance with the image recognition Numpy array of tensorflow then I can apply the same concept downstream. I will expect my input data to have more variability in the dimensions. This may make the learning process slower but at the same time I had advantage of using less dimensions. l was leaving that for my 10GB RAM to compute. So if I can formulate my data in a matrix form then I am all set! It took quite a bit of thinking and attempting some matrix calculations before I was able to take my data and cast it into the array as used in the tensorflow.
Data wrangling is art and it requires a lot of patience and healthy dose of creativity!
Most often Data Science boils down to data quality and quantity. Since the data became ‘Big Data’ we also talk about the Vs (Velocity, Variety, Volume etc.). I started with about 6 GB size of sas7bdat file. The variety in this context was more organic. The data were not collected uniformly, some were ‘a’ and some were ‘A’ even when they meant the same thing. They lacked consistency in physical, logical and conceptual data modeling, There were problems associated with UTF-coding and like many data sets that we deal with every day it has a lot of missing data. If it was a dataset of scatter martix probably the volume would have been one-third. In a nutshell I was attempting to build this model from a dataset that was generated by different people, in different states, in different operating systems and for different purposes. The data needed a lot of cleaning and normalization.
The ANN outperformed my best model by 7 points.
I had two things in my mind on Monday morning. I wanted to understand the conceptual idea behind the artificial neural net and also be able to apply it programmatically to solve a credit risk problem. At the end of the 70 hours sprint I was successful in building a neural net that has better accuracy than all the other models I have been testing for credit risk prediction. The accuracy of the neuralnet was 7 percent better than my go-to logistic regression model.
It was a thrilling experience. I still have not exhausted all the possibilities of fine tuning the neural net I built but I certainly have built a better model with the resources at my disposal. There are many dimensions upon which it can be further improved – one of the easiest being increasing the computing power so that the learning rate of the NN can be reduced further or increasing the number of nodes. There are many ways I think I can work on it to fine tune its hyper parameters. A week was not enough for me to master the neural net but it certainly has reinvigorated my curiosity.
Notes: I wanted to put the codes with explanations on git hub but given that the process and data are both proprietary I am not able to publish it. I will post again if I am able to create a repository for the ANN I have built. Pictures credit to tensorflow.com.
Teacher, Reseacher at Ha Noi University of Science and Technology
6 年very interesting. It is great if you can publish your model, code and data
Medical/Medical Device Cybersecurity
6 年Great post! Keep up exploring the magic of data and algorithms.