Should Data Engineers be Domain Competent?
Shammy Narayanan
Chief Solution Architect | 10x Cloud Certified | Founder - Celebrating Life | Adjunct Professor at VIT | Author
Data and SARS have one thing in common, both are scary. Similar to the rush to get vaccinated, enterprises are in unforgiving haste in assembling a data team with a Midasian dream of monetizing their data. The "Influencers" engine is working overtime and fueling this golden dream, but the question we refuse to ask ourselves is "Are we building the data team right?". Does our data strategy have sanity in-built into it? To find the responses, we need not look outward but a candid conversation with our teams and a peep into the pile of production issues will spotlight our fundamentally flawed approach.
A traditional data engineer views a table with one million records as relational rows that must be crunched, transported and loaded to a different destination. In contrast, an application programmer approaches the same table as a set of member information or pending claims that impact life. The former is a pureplay, technical view, while the latter is more human-centric. These drastically differing lenses form the genesis of the data siloes
Let's start with a common data incident; how often have we witnessed a dilapidating performance due to an inconsistent index? Indeed, the response is a non-zero integer for a long-timer in IT. It's because the indices are built on a set of columns perceived as vital by the DBAs with scant regard for the "True" application access pathways. This builds up to gradual performance degradation requiring re-indexing after a series of nagging customer complaints about the slowness. Isn't this scenario a powerful testament to how lacking a primary domain can profusely lead to distressed customers?
Graduating further, take a closer look at the existing partitioning strategy of your critical databases; I bet my paycheck that 90% of the table partition will be based on the date column rather than on the access parameters. Such a mindless bookish strategy drains the spool, hogs onto the CPU and renders the application irresponsive when a join is executed. Could we have built it right the first time? Not until the data analysts understand the application well. Deploying and celebrating such an ill-designed application is like claiming success in surgery when the patient is lying dead on the table.
领英推荐
The same ignorance gets carried over to the data transformation/processing. Usually, the load balancers are configured to balance the incoming data load. This simple approach works fine as long as the data source is homogeneous; however, in real-time, we have data from heterogeneous sources with conflicting and varying priorities. In such instances, our approach needs to prioritize the criticality. For example, records used for MIS reporting can wait compared to a transaction waiting for pre-authorization. Such smartness in data ontology has to be inbuilt, and it can be done only by a team that understands the domain. On any given day, a low-performing smart pipeline is far preferable to a high-throughput pipeline built on FIFO. I can keep enumerating myriad of such use cases ranging from inefficient APIs and incompetent data invalidation strategies to miserable database locks. All these testaments are not the product of technical incompetencies but the direct impact of the flawed strategy to isolate data teams and treat them as pureplay "Technical Powerhouse."
When we advocate domain knowledge, let's not relegate it to a few Business Analysts who are tasked to translate a set of high-level requirements into user stories, rather domain knowledge implies that every data engineer gets a grip on the intrinsic understanding of how functionality flows and what it tries to accomplish. Of course, this is easier to preach than practice, as expecting a data team to understand thousands of tables and millions of rows is akin to expecting them to navigate a freeway in peak time on the reverse gear with blindfolds; it will be a disaster.
When its amply evident that Data teams need domain knowledge, its also imperative that centralized data teams are not delivering efficient results; Embedding a Data team as part of an application team appears to be the most viable solution; this is where the concept of Data Mesh that is fast evolving, and its sexiness is seducing the enterprises. The next wave of maturity is to move cautiously and swiftly from a centralized mode to a federated model where data teams are de-centralized. Yet, strategic layers such as Data Governance, security and compliance stay under a common umbrella. Will this be the silver bullet for all our problems? We hope and sincerely wish so, but we cannot guarantee it. As data and analytics evolve from the dark underbelly of the IT landscape, we are in to witness more such surprises and twists convoluting this complicated maze. For engineers like me, such a whirlwind is what makes working in Data an exciting and exuberant challenge.
Digital Transformation Leader | I Empower my clients to Accelerate their performance and Elevate their Leadership | Best Selling Author
1 年Shammy , You have well brought out the need for domain competency for data engineers .