Harmony in your data pipelines
"The Beat Goes on" - Poster available to buy on https://www.redbubble.com/i/poster/the-beat-goes-on-by-jrvillustrates/54789491.E40HW

Harmony in your data pipelines

Creating harmony in your data pipelines starts with creating harmony in your data team and not unlike a choir this requires perfecting the arrangement and timing of your star performers.

Before we start conducting we first need an idea of who those performers are, luckily there’s only really two archetypes for us to worry about, the modeller and the engineer. One obsessed with data, one obsessed with code. Too much or too little of either can knock things out of tune? and what I hope follows is a recipe for the perfect harmony of modelling and engineering.

The Modeller

Starting with the modeller let’s get to know our archetypes,

Our modeller, Let’s call her Cher, is intrinsically motivated by making data useful. They obsess over understanding the structure, meaning and quality of data and will get very frustrated if someone can’t tell them how the data will be used. Cher likes interrogating system owners and pointing out all the things that make trusting their data impossible, Cher is also obsessed with having good documentation and will talk excessively about data lineage. Cher is good friends with both Kimball & Inmon and will often be heard quoting stories of their frolics in the 80’s where they had great fun doing this long before people started getting their heads stuck in the clouds. Cher’s younger friends dream in awe of the so-called early days where ‘The original StarS's’ rocked the stages and a test run of a pipeline might have taken a few days, maybe even weeks to complete.

Cher is however well aware of her weakness and is much more successful when performing a Duet with Sonny the engineer.

The Engineer

Sonny is wired very differently to Cher and whilst he exists in the realm of data, Sonny doesn’t actually like dealing with data. To Sonny, data is messy and frustrating and because he’s a perfectionist data quite often feels a bit overwhelming. It’s this feeling of unease that motivates Sonny to engineer systems that make dealing with the data easier, less noisy and more palatable. This drive for simplicity and control can result in an awful lot of code, sometimes more code than data and that’s when things can start to get unbalanced.?

In my experience one of the classic reasons data pipelines and platform projects start to fail is when our engineers don’t understand that their role in the performance is to allow our modellers to thrive and ultimately if our modellers aren’t able to perform then our audience will be left underwhelmed and leaving bad reviews.?

To make that crystal clear, if we aren’t producing new, consumable data products at a steady and reliable pace then the platform has failed. Our modeller should be spending 90% of their time modelling and regularly celebrating the release of new data products with their users.

Signs of harmony

In a data team harmony exists when our modeller Cher is able to model quickly and spend most of their time doing the modelling and not navigating platform complexities and hitting technology roadblocks. Our engineer Sonny’s time is spent relentlessly making things easier for our modeller and regularly gets praise for making life easier for everyone else, they’ll also be found smiling to themselves as they knock a few minutes off a pipeline run or a few £% of the daily cost of running the platform.

Some people might read this and think “I do both” or “our data engineers do both” and that is pretty normal but more often than not preferences and biases are at play and causing havoc. Your teams might find themselves either lost in data or lost in code and in either case not achieving their core objective of building useful data products for end consumers.

Signs of dissonance

After reading this take a look at your data platform and your teams and consider whether you have harmony between engineers and modellers, 5 signs of dissonance include:

  1. Your teams are spending less time modelling data than coding platforms
  2. Your engineers don’t consider your modellers to be their end users
  3. You have more platform code (python, yaml etc) than data code (SQL)
  4. Your modellers aren't empowered to work end to end, they need engineers to trigger things, move things around and control releases
  5. Your team can’t remember why they started developing the pipeline in the first place

The beat goes on

To conclude, like any good duet, our performers might not always get along or understand how to play to their own strengths especially when things start getting bigger and more complex. When things are unravelling we need our conductor, a.k.a., our data architect to step in and provide leadership. Our data architect is able to transcend the modeller and engineer archetypes, seeing the big picture and understanding what the audience needs in order to conduct the perfect harmony.?

Finally, if you take nothing else away from this, just remember, “The beat goes on.”.

Thanks for reading,

Robert Wadsworth - Director of Data and Artificial Intelligence @ Kin + Carta Europe

#Data #DataEngineering #DataModelling #DataPlatform #DataPipeline

Lee Doolan

Cloud Data Warehouse Architect | Cloud Data Engineer | UK Based

2 年

Nice post Rob, and good to see a familiar face at K&C Michael Cross MBCS CITP

要查看或添加评论,请登录

社区洞察

其他会员也浏览了