登录查看更多内容

Worried about Data Science Cloud Cost? - Here are 8 tips to reduce the cost up to 80%

Ujjyaini Mitra

Building Deltacube.ai & SETU. DC offers Data Science, AI innovation & Digital Transformation. SETU - Bridging the Gap of Emerging Industry Skill demand in Data Domain

发布日期: 2022年5月31日

At TheDelatacube.ai I am working with a client who have all their historical data in their on-premise servers. As they are moving digital, it is imperative that they move data on to cloud. However, the MD-CEO told me 'I heard Cloud ultimately costs a lot and that's why our IT team suggested us to continue with on-premise server'. I assured him that what he heard about cloud cost has happened with many organizations too, however, there are ways to keep it under control and that is why cloud computing is not everyone's cup of tea, it requires proper knowledge.

Today I read similar points on one of the data science group chat, that I am part of. I am using cloud infrastructure for last 6 years and I faced the same challenge earlier. At the same time, with proper processes we could control 80% of the cost too. I thought of sharing my learning on how to control cloud cost, specially if you are a B2C company with millions of consumers coming to your platform and performing billions of events which your data science team is analyzing to support business decision making.

Analyze - Every cloud provider offers you detailed breakdown of the cost. They even offer hourly breakdown of the usage, so that you can identify when the consumption is high and correlate with jobs running at that time. To help analyze better, we gave individual accesses so that we can monitor who is contributing how much of computation cost, therefore, which kind of analyses costing us high. As we know devil is in the details, therefore, this granular analysis of the cost will help you identify the ghost.
Data Indexing - One of the biggest reason computation cost was going high, because our data was not well indexed. This is one of the basic step, yet we were a data scientists heavy team and not everyone possessing enough knowledge about data engineering. Cost guzzlers are the 'search' queries and 'sort' queries. If data is well indexed, say by year-month-date, and if you are reading data for a specific date range, then search will be performed fast, thus reducing the computation cost. For different purposes, you need to index the data on different parameters, before using it.
Optimized Query writing - When we analyzed the cost data we found few members of our team are contributing 80% of the cost, while few others are contributing much less. And this was not correlated with the complexity of the projects or the amount of data they were processing. The differences were first set of people were writing amature codes. For example, the data table has 10 columns and 10 million data rows. One of the column talks about consumer category - Free vs Paid, where 70% are free consumer and 30% paid consumers. I want to analyze for the paid consumers. The first set of amature coders will read the full table and save it in the local memory, and then filter out paid consumers. While optimized coder would read only filtered data, by using where clause at the time of reading. Now if the data was indexed based on date and consumer category, the reading the data of 30% paid consumers for a date range will be super fast, as well as cost ~70% less than the amature code.
Training - Data Scientists come from various background and not everyone gets the training of 'Data Structure' and 'Computational algorithms', as these are mostly taught in the computer science courses and not in the most Statistics or Mathematics courses. Therefore, irrespective of the seniority, we arranged for a 'Big data management academy' where we trained everyone joining the team about basics of handling big data. They were given hands on tests along with buddies to help, once they clear it, would get access to the cloud accounts. We created a group of young champions who could help others in case they get stuck.
Build individual thresholds - Since we had separate account for each data scientists, we also created an upper bound of the computing units. For example, if you use Databrick, DBU is the unit and we offered x DBUs of large machines for every data scientists as a default. In case, someone's code is not executing within that, or taking too long, they will reach out to the Champions, who will do a scrutiny of the code and find ways to optimize, else will increase the DBU or offer xl machine. Those will be on demand and reduced to default once the job is over. That way, we knew an upper bound of the monthly costs, which was well within our budget.
Do you need all data? - This is a question we need to ask ourselves while analyzing the data. Big Data has given us a lot of information, however, more data does not always mean more information. Data Scientists fancy over large data, however, same analyses could have been done on last 30 days data instead of 180 days data. Or, when we are building the first model, we are doing a lot of experimentations, where, we do not need to use complete 6-12 months data. During initial development of analyses or model, we may only use data from recent past (say 7-30 days) or take a 10-20% stratified random sample from last 12 months data, which is good enough for the initial experimentations. Therefore, having a sand box facility, where, development works happen on smaller data is important. We also set up the same, which reduced cost further.
Pre computed data tables - Last but not the least, when we looked at the data KPIs most consumer specific models are using, we found a 60% overlap among them. Since different data scientists are working on different projects, each of them were individually computing the same KPIs for their modeling purpose. We collected those common KPIs and prepared a data pipeline to precompute them during the lean hours and kept them in a master data table. Data scientists working on any consumer specific modeling work, now can just fetch those data from the pre-computed table - which reduce time and computation cost both.
Don't forget Data Storage - You must have heard that on cloud, data storage cost is minimal. However, this is a cumulative cost. For example, if you are collecting 10TB incremental data every month, first month you will pay for 10 TB, 2nd month for 20TB, 3rd month for 30 TB and so on. So, over time, the storage cost also goes high. Thus, data archiving and archival policy is an important part of cloud cost reduction. Apart from archiving, we should revisit all the temp tables created during various experimentations, which are redundant now, should be permanently deleted.

领英推荐

Anchoring Your Knowledge: Azure Storage Unveiled -…

Corey Knapp 1 年前

Deep Dive into Cloud Cost Optimization: Engineering…

John Enoh 5 个月前

The Future of Cloud Computing: 5 Key Trends Shaping…

Jeff DeVerter 7 个月前

Let me know if you found this helpful. Contact me if you are facing similar challenges, will be happy to help.

Srividhya Balasubramaniam

Data and AI Architecture & Platform Leadership

2 年

- storage cost includes cost of transactions in storage too. if underlying data formats are not optimized it increases cost on data lake (especially with delta lake formats) - using right type of VM and cluster for the job are additional points

1 次回应

Srividhya Balasubramaniam

Data and AI Architecture & Platform Leadership

2 年

This is pure Gold!

Devranjan Dash

2 年

My thoughts:- Spot on Ujjyaini Mitra . Comparing it to that of library or a dictionary unless data is sorted in manner which facilitates easy retrieval the computing process takes time and also lots of resources. Having said isn't the most commonly used format for data ingestion is a sequential manner with time-stamp?

1 次回应

Rajesh Devasia

AI I Data Science I 40 Under 40 I Analytics Consulting I

2 年

Thanks for posting...very practical approach ??

1 次回应

Chandan Nandy

Learning Data architecture ...& more

2 年

Also, I believe there is a basic difference between partitioning, clustering ( Aka Bucketing ) and indexing. If you do a partition based on (YYYY, MM, DD) in that hierarchy, say for 10 years of transactional data, where data is considerably not skewed. In the worst-case scenario, your maximum search will be {(10+12+30/31) ~52), and still, if you have 100 million records in that day, you can do clustering which is again creating, say 200 buckets, which will do a hash calculation on the PK/SK and send the record to a particular bucket. Hence while retrieving a record of a given transaction id/customer id , it will be consuming ((1/10)/12)/31) / 200 )th times of sequential searching. However, indexing comes from OALP etc idea where you put PK , and SK initially before even loading the data. Is slower because every time I put a record I need to check for NULL value constraint & Duplicate constraint issues etc. I think this is one of the reasons why Snowflake doesn't force to put any index as such while loading the data (while taking it offline in dbt) Seems my understanding of indexing from a data engineering point of view is a little different from your data science point of view. Let me know if I have misunderstood.

1 次回应

查看更多评论

要查看或添加评论，请登录

Ujjyaini Mitra的更多文章

Does Luck exist?

2021年7月23日

Does Luck exist?

What is luck? You have applied for a job interview, your profile has more than 80% match to the skills the JD has…

42 条评论
How to ask ‘WHY’ Questions in Data Science?

2020年4月25日

How to ask ‘WHY’ Questions in Data Science?

Last week I wrote about ‘What takes a Data Scientist to be a Great Data Scientist’ and I mentioned that ‘Knowing WHY we…

12 条评论
What takes a Data Scientist to be a Great Data Scientist?

2020年4月19日

What takes a Data Scientist to be a Great Data Scientist?

Over last 13 years, while building Data Science teams from scratch for multiple organizations, I have come across many…

31 条评论
Sending your resume? Don't do these

2020年2月16日

Sending your resume? Don't do these

I will come straight to the point. This is for Candidates applying for jobs through LinkedIn.

25 条评论
Hackathon – a proven method to hire

2019年10月2日

Hackathon – a proven method to hire

What’s the main intent behind Interviewing a candidate before hiring? I asked this question to close to 143 senior…

11 条评论
Data and Analytics – The best kind of decision-making

2019年6月28日

Data and Analytics – The best kind of decision-making

User data is a numerical reflection of user behaviors and emotions. When reams of data are interpreted by analysts…

3 条评论

See all articles

Worried about Data Science Cloud Cost? - Here are 8 tips to reduce the cost up to 80%

Ujjyaini Mitra

Building Deltacube.ai & SETU. DC offers Data Science, AI innovation & Digital Transformation. SETU - Bridging the Gap of Emerging Industry Skill demand in Data Domain

领英推荐

Ujjyaini Mitra的更多文章

社区洞察

其他会员也浏览了

Google Cloud Transforming Industries with Innovation, Scalability, and Advanced Cloud Solutions

The Modern Data Ecosystem: Optimize Your Storage

How to Choose the Right Google Cloud Services for Your Needs

How AWS Cloud Drives Innovation and Helps Businesses Grow?

Why Azure is the Go-To Choice for Modern Businesses?

Going single cloud, multi cloud and/or cloud agnostic at Data Mass Gdańsk (Oct 5th)

August 04, 2023

Data Science with Cloud Computing

Google Next ’22: The Beginning of a New Era of Built-in Cloud Services

Modernising Your Data on Cloud: A Comprehensive Guide by Kasmo

领英推荐

Ujjyaini Mitra的更多文章

Does Luck exist?

How to ask ‘WHY’ Questions in Data Science?

What takes a Data Scientist to be a Great Data Scientist?

Sending your resume? Don't do these

Hackathon – a proven method to hire

Data and Analytics – The best kind of decision-making

社区洞察

其他会员也浏览了

Google Cloud Transforming Industries with Innovation, Scalability, and Advanced Cloud Solutions

The Modern Data Ecosystem: Optimize Your Storage

How to Choose the Right Google Cloud Services for Your Needs

How AWS Cloud Drives Innovation and Helps Businesses Grow?

Why Azure is the Go-To Choice for Modern Businesses?

Going single cloud, multi cloud and/or cloud agnostic at Data Mass Gdańsk (Oct 5th)

August 04, 2023

Data Science with Cloud Computing

Google Next ’22: The Beginning of a New Era of Built-in Cloud Services

Modernising Your Data on Cloud: A Comprehensive Guide by Kasmo