Anatomy Of A Data Stack (2024 Update)
Data is all around us. In fact, according to Data camp for every grain of sand on earth, there are around 40,000 bytes of data!? Finding, storing and sorting this data is a huge challenge for any organisation but crucial if they want to turn that data into smarter decisions.
What Is A Data Stack?
A data stack is essentially all the different tools you will use to organise, transform, visualise and analyse data. There is no such thing as a ‘standard data stack’ but they are usually comprised of the following elements:
Data Sources
A data source is anything which produces digital information. This could be a file, a programme, a website etc. Every organisation uses multiple data sources everyday, and often they have to combine metrics from multiple sources to get the answers they need.?
Data Extraction
Data extraction is the process of obtaining relevant data from your source(s) and essentially copying it into a new place.
It sounds simple but not all data sources are neat and tidy! Typically you cannot edit the data source itself, which means you can only extract the data that is available, at the time it is available, in the format it is available.
Data sources can be complex in nature or poorly documented (or both) which means that determining which data points need to be extracted, and when that data will refresh can pose a challenge.?
When you are setting up an extraction layer, you are creating an automated process in which you copy over the relevant data from your source to some form of storage. There are several tools out there which can help with this such as Fivetran, Airbyte and Stitch which come with pre-built connectors. These tools can help you extract popular metrics from popular sources without needing to write any code. Often, the way these tools charge you is on how many rows of data you are extracting and so it is important from a strategic point of view, only to copy the data you will need.?
Data Storage
All of the data you have copied will need to be stored somewhere. This could be a data warehouse, data lake or data lakehouse. The difference between these three is that a data warehouse stores structured data? (i.e. data that you are ready to use and is formatted) a data lake stores unstructured data (i.e. in its original format) and lakehouses give you the ability to do both.?
No matter which option you chose, it is important to consider; the way in which you set up and organise your data storage, how scalable it is, and with that cost considerations. The most popular options we work with included Google BigQuery, Snowflake, Amazon’s Redshift, Microsoft’s Azure and Databricks.?
Data Modelling
You have copied over your data from its source into your data storage. While it is here, it will then go through a process of transformation called data modelling. This allows you to overlap metrics from different sources and apply calculations to them en-masse, automatically. These data models can then be easily consumed by reporting tools or data analysts.
Creating data models is one of the more complex and time consuming tasks when putting together your data stack. One data model can easily have 200 lines of logic associated with it, all of which need to be reconciled. Doing this diligently takes time, and we always recommend that two people work on modelling so they can check each other’s work.?
Data models are absolutely crucial if you want to be able to centralise and align all the metrics across different channels, and later they are needed for automating workflows or to move onto LLMs and AI.
领英推荐
Data Analysis
Where most 'business users' will rely on dashboards and charts to understand their data at a top level, the analysis layer is where you can deep-dive into specific topics in much greater detail. It is typically used by Data Analysts who are looking for robust answers to business questions as opposed to many of the 'day-to-day' metrics which might be surfaced in a dashboard. ?If you end up repeating an analysis frequently, it should then be moved to the reporting layer with an automated dashboard.
For this layer, you need an easy workflow for your analysts to explore data and present insights. Jupyter notebooks are a good option, as they are widely used and have kernels for many programming languages. Using a programming language and notebooks over Excel al brings transparency to data processing, enables automated re-runs (as you only need to change the underlying dataset) and allows you to handle larger datasets. You might still need to consider though, befriending your marketing team or brushing up your presentation building skills when it comes to sharing these insights with the wider business.
Reporting?
This is the layer that most of your business users will be familiar with. Here you are taking your modelled data and turning it into charts, graphs and dashboards which make information easy to understand. One of the key purposes of reporting is to automate repeated data requests so that your data team can then focus on finding deeper insights and generating value. Dashboards are a great tool for everyday monitoring of key stats that will guide your business.
While many people are familiar with dashboards, there is an art to creating a great one. A good dashboard should allow you to compare and investigate metrics easily. It should provide context as to whether the number is good, bad or unusual and where possible, should serve up automated alerts for any key actions that need to be taken. We generally advise that less is more with a dashboard and so the first step in creating a great one is to work with stakeholders and to understand which metrics really matter to them. Remember that anything you build will have an ongoing maintenance cost associated with it and so it is always better to have 10 key dashboards rather than 100.
As your visualisation tool is often the ‘face’ of data within your business, it is crucial that all the information within is 100% right as this will damage trust in the data if not. While many visualisation tools will come with some level of data modelling, 173tech would not advise you to use this. All data manipulation should be done in the modelling layer, as what is available in visualisation tools tends to be basic and may end up causing slow-loading.?
There are many different visualisation tools each with their strengths and weaknesses. The main tools that we recommend are Tableau, Power BI, Metabase and Looker (not Looker Studio).
Data Science
Data Science can be considered an advanced and more comprehensive extension of data modelling, involving not just the construction of models but also a broader range of data processing and analytical techniques.?
Data science tackles much larger and more complex datasets, often referred to as "big data." The primary goal is to uncover hidden patterns, extract valuable insights, and make accurate predictions. Due to the sheer volume and complexity of the data, machine learning algorithms are frequently employed to automate and enhance the analysis process. These algorithms can efficiently handle massive datasets and perform sophisticated calculations that would be impractical with traditional methods. The accuracy of these machine learning models tends to improve as they are exposed to more data, allowing them to learn and adapt over time.?These more advanced algorithms and models are the foundations you need in order to implement AI later on.
Data Activation
During most of our process so far, we have taken data from different sources, modelled it and then served it up visually in dashboards or reports. That still means that someone needs to take this information and perform an action. That still presents a barrier as information living in one system then needs to be actioned in another. Reverse ETL is a process that takes the data you have already modelled and then send this back into the applications your teams use.? This might be for example, applying a churn flag into your CRM so your sales team can intervene. Or sending back your Customer Lifetime Value to your ad providers to improve their algorithm.
Where you have data models, reverse ETL can help make them a lot more actionable. There are only two tools we currently recommend for Census and Hightouch.
Orchestration
Last, but not least, sitting on top of everything you have data orchestration. This is the tool which takes care of managing and coordinating all of the different processes happening within the data stack. You can think of it almost like a taskmaster or an automated to-do list.
Bringing Your Data Stack To Life
There are a lot of different considerations when it comes not only to selecting the right tools, but in setting those tools up in the right way. If you are looking to get started with analytics or need to migrate to a more modern setup then why not get in touch with the friendly team here at 173tech? We are always happy to hop on an informal call and offer impartial advice and opinions.