Future proofing your data analytics
TLDR;
Ok, ok, I’ve been around a while now.? BI since 1998 at MicroStrategy, and many moons implementing data warehouses for customers until 2020 when I joined what appeared to me as the next great data company, Databricks.
In the early to mid-2000s there was a lot of discussion about separating compute from storage.? Then, the biggest data warehouse servers were massive machines that required that the data was on a disk and attached to the CPUs to process it.? If you hit the limits of that compute, well, you either had to parse up the day and your processing, or perhaps add another machine, which would require you to copy the data to a new machine.
It made sense that if you separated your data from the computer processing it, you could add CPUs without having to replicate the data.? Different jobs could act on the same data with its own compute.? At the time, Snowflake was the leader in championing this message in the Data Warehouse community, but Spark was exercising the same concepts, as I learned from Yahoo’s talk in 2014 at MicroStrategy World.
Since then, the growth of data processing on distributed computing has skyrocketed.? Queries taking hours now take minutes, and we will never go back.? But the need for migrations remains today, forcing customers to invest in order to change their processing from one vendor to another.? Given my experience, I’m adding a new adage now:??
Separate your code, compute, and storage to future proof your data!
Lesson 1:? Use open formats to store your data
When cloud vendors started back in the early 2000s, they provided some really huge game changers.? You could easily spin up virtual machines, as many and as large as you needed.? They provided unlimited, super cheap storage that you could store anything.??
You could run anything on Linux and process it at scale.??
With open formats like Parquet–which led to Delta, Iceberg, and to some extent Hudi–customers not only have the flexibility to use any compute, but almost any code libraries or languages.
This is true as long as the data is on a standard disk or a cloud object store, like S3/ADLS/GCS, and not in a proprietary store, like Snowflake’s cloud, or Azure’s OneLake, Redshift, or other vendor-exclusive formats.? Keep your options open, make it portable, and capable to run anywhere.? Use open formats on simple storage.
Lesson 2:? Use Open code libraries that use Open languages
Back in 2016 I helped do a PoC with my company’s data warehouse that was hosted on Oracle.? The target was Snowflake.? Using a package called Ora2Pg (yes, Pg for Postgres, but Postgres is mostly ANSI compliant and so is Snowflake), I was able to migrate our largest tenant to Snowflake in less than a week.??
Having code written in an open language, such as ANSI SQL, drastically simplifies migrations because it runs on MySQL, Postgres, Databricks SQL, Snowflake, Redshift and so many others.? Of course each platform extends the function libraries at the demand of customers to provide needed additional functionality beyond the ANSI standards.? You should weigh the value of using extended functions, but more than likely, the demands for these capabilities make them likely to be introduced to other platforms over time, expanding your flexibility in the future.
In the same vein, an open language like Python for data has been very resilient to change.? Open source Machine Learning libraries and data manipulation languages provide support for Python making it so popular for data practitioners that we can rely on it well into the future.
Lesson 3:? Keep your code separate from applications and store it in Repositories
As Data Warehousing and Business Intelligence grew as an industry, every single tool had its own repository.? To reliably version your code and provide restorability you had to figure out how to export the code and save it as part of your branch in a code repository.
So as from Lesson 2, if your code is in open formats, and you store your code in an external repository, you have ultimate flexibility.? With the ability to store your logic external from an application with open formats, you can use the same code in multiple places.? This strategy gives you ultimate flexibility, separating code, compute, and storage.??
I know it's probably funny for software developers to read this, but believe me, folks are still constantly weighing the pros and cons of migrating from one platform to another.? The biggest con is often technical debt; their business logic and data are all tied up and able to only operate on a single platform.? If in the process of their next migration, they also manage to follow these recommendations–separate their code from the applications that run it–they will reduce their risk to try new technologies and remain flexible to adapt and change well into the future.
Technology Partnerships
2 周Don’t forget an open source semantic layer ??