At TheDelatacube.ai I am working with a client who have all their historical data in their on-premise servers. As they are moving digital, it is imperative that they move data on to cloud. However, the MD-CEO told me 'I heard Cloud ultimately costs a lot and that's why our IT team suggested us to continue with on-premise server'. I assured him that what he heard about cloud cost has happened with many organizations too, however, there are ways to keep it under control and that is why cloud computing is not everyone's cup of tea, it requires proper knowledge.
Today I read similar points on one of the data science group chat, that I am part of. I am using cloud infrastructure for last 6 years and I faced the same challenge earlier. At the same time, with proper processes we could control 80% of the cost too. I thought of sharing my learning on how to control cloud cost, specially if you are a B2C company with millions of consumers coming to your platform and performing billions of events which your data science team is analyzing to support business decision making.
- Analyze - Every cloud provider offers you detailed breakdown of the cost. They even offer hourly breakdown of the usage, so that you can identify when the consumption is high and correlate with jobs running at that time. To help analyze better, we gave individual accesses so that we can monitor who is contributing how much of computation cost, therefore, which kind of analyses costing us high. As we know devil is in the details, therefore, this granular analysis of the cost will help you identify the ghost.
- Data Indexing - One of the biggest reason computation cost was going high, because our data was not well indexed. This is one of the basic step, yet we were a data scientists heavy team and not everyone possessing enough knowledge about data engineering. Cost guzzlers are the 'search' queries and 'sort' queries. If data is well indexed, say by year-month-date, and if you are reading data for a specific date range, then search will be performed fast, thus reducing the computation cost. For different purposes, you need to index the data on different parameters, before using it.
- Optimized Query writing - When we analyzed the cost data we found few members of our team are contributing 80% of the cost, while few others are contributing much less. And this was not correlated with the complexity of the projects or the amount of data they were processing. The differences were first set of people were writing amature codes. For example, the data table has 10 columns and 10 million data rows. One of the column talks about consumer category - Free vs Paid, where 70% are free consumer and 30% paid consumers. I want to analyze for the paid consumers. The first set of amature coders will read the full table and save it in the local memory, and then filter out paid consumers. While optimized coder would read only filtered data, by using where clause at the time of reading. Now if the data was indexed based on date and consumer category, the reading the data of 30% paid consumers for a date range will be super fast, as well as cost ~70% less than the amature code.
- Training - Data Scientists come from various background and not everyone gets the training of 'Data Structure' and 'Computational algorithms', as these are mostly taught in the computer science courses and not in the most Statistics or Mathematics courses. Therefore, irrespective of the seniority, we arranged for a 'Big data management academy' where we trained everyone joining the team about basics of handling big data. They were given hands on tests along with buddies to help, once they clear it, would get access to the cloud accounts. We created a group of young champions who could help others in case they get stuck.
- Build individual thresholds - Since we had separate account for each data scientists, we also created an upper bound of the computing units. For example, if you use Databrick, DBU is the unit and we offered x DBUs of large machines for every data scientists as a default. In case, someone's code is not executing within that, or taking too long, they will reach out to the Champions, who will do a scrutiny of the code and find ways to optimize, else will increase the DBU or offer xl machine. Those will be on demand and reduced to default once the job is over. That way, we knew an upper bound of the monthly costs, which was well within our budget.
- Do you need all data? - This is a question we need to ask ourselves while analyzing the data. Big Data has given us a lot of information, however, more data does not always mean more information. Data Scientists fancy over large data, however, same analyses could have been done on last 30 days data instead of 180 days data. Or, when we are building the first model, we are doing a lot of experimentations, where, we do not need to use complete 6-12 months data. During initial development of analyses or model, we may only use data from recent past (say 7-30 days) or take a 10-20% stratified random sample from last 12 months data, which is good enough for the initial experimentations. Therefore, having a sand box facility, where, development works happen on smaller data is important. We also set up the same, which reduced cost further.
- Pre computed data tables - Last but not the least, when we looked at the data KPIs most consumer specific models are using, we found a 60% overlap among them. Since different data scientists are working on different projects, each of them were individually computing the same KPIs for their modeling purpose. We collected those common KPIs and prepared a data pipeline to precompute them during the lean hours and kept them in a master data table. Data scientists working on any consumer specific modeling work, now can just fetch those data from the pre-computed table - which reduce time and computation cost both.
- Don't forget Data Storage - You must have heard that on cloud, data storage cost is minimal. However, this is a cumulative cost. For example, if you are collecting 10TB incremental data every month, first month you will pay for 10 TB, 2nd month for 20TB, 3rd month for 30 TB and so on. So, over time, the storage cost also goes high. Thus, data archiving and archival policy is an important part of cloud cost reduction. Apart from archiving, we should revisit all the temp tables created during various experimentations, which are redundant now, should be permanently deleted.
Let me know if you found this helpful. Contact me if you are facing similar challenges, will be happy to help.
Data and AI Architecture & Platform Leadership
2 年- storage cost includes cost of transactions in storage too. if underlying data formats are not optimized it increases cost on data lake (especially with delta lake formats) - using right type of VM and cluster for the job are additional points
Data and AI Architecture & Platform Leadership
2 年This is pure Gold!
Design Thinking for making Marketing Customer Centric|Coalescing Brand and Performance for Customer Lead Business Growth| MarTech and Adtech Expertise to evangelise Customer Journey |Data Intelligence |@IIMB|@MIT
2 年My thoughts:- Spot on Ujjyaini Mitra . Comparing it to that of library or a dictionary unless data is sorted in manner which facilitates easy retrieval the computing process takes time and also lots of resources. Having said isn't the most commonly used format for data ingestion is a sequential manner with time-stamp?
AI I Data Science I 40 Under 40 I Analytics Consulting I
2 年Thanks for posting...very practical approach ??
Learning Data architecture ...& more
2 年Also, I believe there is a basic difference between partitioning, clustering ( Aka Bucketing ) and indexing. If you do a partition based on (YYYY, MM, DD) in that hierarchy, say for 10 years of transactional data, where data is considerably not skewed. In the worst-case scenario, your maximum search will be {(10+12+30/31) ~52), and still, if you have 100 million records in that day, you can do clustering which is again creating, say 200 buckets, which will do a hash calculation on the PK/SK and send the record to a particular bucket. Hence while retrieving a record of a given transaction id/customer id , it will be consuming ((1/10)/12)/31) / 200 )th times of sequential searching. However, indexing comes from OALP etc idea where you put PK , and SK initially before even loading the data. Is slower because every time I put a record I need to check for NULL value constraint & Duplicate constraint issues etc. I think this is one of the reasons why Snowflake doesn't force to put any index as such while loading the data (while taking it offline in dbt) Seems my understanding of indexing from a data engineering point of view is a little different from your data science point of view. Let me know if I have misunderstood.