How to add ML classifier feature to your product in 4 days
Dataiku flow

How to add ML classifier feature to your product in 4 days

This is the second blog post of the series that started with "Creation of ethscore.net in 16 days". You don't have to read to first one. I tried to write them as independent as possible. Even with the blog posts, I am thinking about dependencies!

In this post, I will explain how you can add a basic ML classifier feature to your product in just 4 days. In this case, I wanted to add a trust score for an Ethereum wallet ID, meaning, how much can you trust that account.

Day - 1

First, I had to find data. For now, I didn't even decide whether I'll go for a supervised machine learning algorithm or an unsupervised one. I wanted to make that decision based on what kind of data I could find. After a long day of researching data online, I found a list of (around 700) Ethereum wallet IDs that were reported as fraudelent. I randomly checked a few of them, it was mostly phishing fraud. This was a good start but I needed the detailed data of these accounts. I wrote a script to fetch the detailed data of these accounts from https://etherscan.io/

I copied below a snippet of the etherscan.io API integration code if you need it.

etherscan_transactions_url = "https://api.etherscan.io/api?module=account&action=txlist&address="+i['address']+"&startblock=0&endblock=99999999&page=1&offset=10000&sort=asc&apikey=1234567890asdsdfgfghfjgd
response = requests.get(etherscan_transactions_url)
transactions_json = response.json() 
fp = open(output_filepath+i['address']+".json", 'w')
fp.write(json.dumps(transactions_json))
fp.close()"        

The code above create a file for each Ethereum wallet id with all transaction history. However, I needed a summarised structural data that I can feed into the training model. So, I wrote another script to scan each file and calculate numbers like total number of transactions etc.

Day - 2

With the 2 scripts above, I had a CSV file ready to train the model. Or do I? Of course not, because in order to train the model, I need normal accounts as well as the fraudelent accounts. I can't train the model with only fraudelent account data. In order to do that, I decided to find a contract that is used to buy NFTs and decided that anyone who bought NFT through that contract would be a normal account. Now, you might think a fraudelent account can use the same account for phishing but also for self investment. However, that's not the case, cyber criminals never use the same account for their personal usage (ie: to buy NFTs) and for their criminal activities. I used the same 2 scripts above to create the data for the normal Ethereum wallet IDs.

After having my data ready in a single CSV file, with a column "fraud" as 1 or 0 to classify the data, I was ready to train my model. I looked around AWS and GCP to see if I could achieve something very quick. However, they both proved to be too complicated for a basic model I was trying to create. In AWS, there are ready models but they are for very specific usage. For example, if you want to check if a credit card transaction is fraudelent or not, you can start using AWS Fraud Detector with just a few clicks. Or the same for the creation of fake accounts, you can use AWS easily. However, if you want to build something custom, you have to build an entire flow of S3 - SageMaker - Write your code in SageMaker (maybe use xgboost) - Create a Lambda function - Open your lambda function through API Gateway, and solve all of the configuration problems you would face on your way... I needed something quick and simple. So, I decided to use Dataiku. I knew Dataiku from one of my consultancy engagement where I was responsible to setup Dataiku for a global mining company's Data Science function. I knew how easy it was to train a model and create an API service for scoring. It took me only a few hours to feed my CSV file and create this flow.

No alt text provided for this image
Dataiku flow screenshot

Dataiku has a feature to compare different algorithms and inform you with the success rate. In my case, all 3 supervised algorithms performed almost the same. So, I chose the simplest one, logistic regression. I found this feature very powerful. Normally, if I had to do all of that from scratch, it would have taken me weeks but with Dataiku, it took me only a few hours.

Day - 3

The last step was to expose an API to classify a given data set as fraud or not. This proved to be a complicated as the documentation missed that requirement to create an extension service in order to be able to create an API. It took me half day trying to find out what the problem was. Luckily, Dataiku Forum was very responsive. If you are interested to see the problem I had and the solution, this is the post I created in the Dataiku Community.

I quickly integrated my Lambda function with the Dataiku API service. The service does not only provide "1" or "0" but also a probability of that decision. I used the probability of the decision as well which helped to create a score rather than a black and white result. I, shamefully, added a few hard coded rules for the scoring. Hopefully, when I improve the ML, I will be able to remove these hard coded interference.

If you want to learn more about Dataiku or need some help to get started, let me know. I will be happy to help.

Day - 4

Now it was time to add the visualisation to the page. It felt like I made the right choice by using MUI because MUI had the exact component I needed to show the result. I added a new React component, used the score information and displayed it as below.

No alt text provided for this image
ethscore.net screenshot

This was the second blog post of the "Creation of ethscore.net in 16 days" series. In the next blog post, the third one, I will share the AWS architecture of this solution.

要查看或添加评论,请登录

Can Huzmeli的更多文章

  • Engineering Productivity at Different Stages

    Engineering Productivity at Different Stages

    Yes, I decided to get on the bandwagon of the software engineering productivity debate because why not. I have been…

    5 条评论
  • Essential Metrics for Continuous Improvement in Software Development

    Essential Metrics for Continuous Improvement in Software Development

    In the ever-evolving landscape of software development, measuring key metrics is crucial for driving continuous…

  • Maintaining the code base of decline-stage products

    Maintaining the code base of decline-stage products

    ?? 75% of your features are at the decline stage (you can't get rid of them and you have to contractually maintain…

  • Too good to be true

    Too good to be true

    This is the last post of the series "Creation of ethscore.net in 16 days".

  • AWS architecture of ethscore.net

    AWS architecture of ethscore.net

    This is the third post of the "Creation of ethscore.net in 16 days" blog post series.

  • Creation of ethscore.net in 16 days

    Creation of ethscore.net in 16 days

    In this article, I will explain how you can create a product like ethscore.net in 16 days.

  • B2B customer escalations

    B2B customer escalations

    "When we raise a customer escalation to you, how do you manage it?" asked one of the engineering managers during our…

  • “Agile is not for us”… seriously?

    “Agile is not for us”… seriously?

    On March 3, 2010, the Federal Bureau of Investigation killed its biggest and most ambitious modernization project — the…

  • 3 Ways to Use Consumer Psychology to Your Advantage

    3 Ways to Use Consumer Psychology to Your Advantage

    "It is the result of thinking, not the process of thinking, that appears spontaneously in consciousness." George Miller…

  • Two reasons why we hate design changes

    Two reasons why we hate design changes

    I can remember my reaction when Facebook changed its news feed layout, I was really angry. But now I can't remember at…

社区洞察

其他会员也浏览了