Circus of assessment, architecture, sizing and modeling in Data Management
On receiving a back-of-the-envelope estimate of 5-6 people for six months, the business head was mad at me.
- This was for building a data repository and ingestion pipelines for 50+ streaming and micro-batch data sources
- A mixed bag of data sources - some received via sophisticated API calls, others scraped from emails or websites while most received as csv/xls
- Some streamed in throughout the day, while others were loaded weekly or monthly
- Most were daily and the critical ones were streaming in every few minutes
- There was also a lot of processed and semi-processed historical data (10 years+) stored in adhoc-ly designed MS-Access databases
- There was no dearth of software - Db and Pipeline building tools were easily available.
With my ball-park estimates, I also added five key messages
- An assessment (2-3 week exercise) of the environment covering data sources & structures, usage/application, technology and other volumetric information would be needed
- A well defined scope, architecture, detailed approach, project plan with headcount and skill would be delivered as output of the assessment exercise
- A sizing exercise to determine Infra needs (storage, compute, network) and Db sizing
- Data modeling/Db design of the data repository was the crux of the project
- Typical project execution involved - One time high level requirements & design and iterative low level design->development->testing->deployment
His contentions were
- Why cant you just take each data source one by one and ingest in some Db starting tomorrow?
- What is the need for any kind of assessment exercise?
- Why do any elaborate sizing now? We will add capacity as and when needed
- What is the need for senior data modelers in this assignment?
- Why don't you load a couple of data sources in a data repository and show us?
His contentions were irksome and unreasonable but wearing my data practitioner hat I conveyed the relevance of doing the assessment and sizing activities. I don't think I was very successful in my solution proposition and the thread trailed off.
Interestingly, I did find merit in two of his points - i.e.
- taking one data source at a time and
- loading a couple of data sources as a Proof-of-Concept
Why did I have reservations towards these two points, i couldn't put a finger on it.
- Was there any harm in doing a PoC with a couple of data sources? and
- Why couldn't we ingest the data sources one by one and finish the activity?
- Did we really need all the circus of assessment, architecture, sizing and modeling?
On the 3rd day after my interaction with the Business head when i was crossing Vashi bridge it struck me!
Yes, doing the execution (i.e. software development of ETL pipelines in this scenario) for ingestion of data sources one by one is perfect, but the preceding steps of assessment, high level requirement and design are inevitable for a successful project. There is a high probability of landing up with another ad-hoc, randomly designed data repository without defining a blue-print - architecture, sizing and approach.
Doing it slam-dunk in such data management scenarios does show immediate short-term gains but they are soon eroded in the face of rework effort, patch-work and eventually badly designed and low performance yielding data assets.
The business head's contention was valid but only applicable to the 2nd half of the project. Implement piece-meal but plan, size and design overall.
PoCs are great when its a new piece of technology, new process, new methodology - new to the market, new in the industry and maybe just new in the organization. They help unearth challenges, validate hypotheses, provide exposure and yield confidence. PoCs are a good vehicle to do demonstrate capability and can act as a precursor to a project but due to their tactical emphasis miss out on the overall plan and design elements. If PoCs are segued into an actual project using the code and structures it is most likely heading towards a catastrophe.
Data modelers are the midfielders of the data management projects. Like soccer midfielders they are generally more experienced and have good insight.
Like midfielders in soccer
- data modelers support the forwards (visualization and data science developers)
- can fallback to support the backs (data engineers)
- generally control the game - feed the visualization developers (forwards) and thwart/soften attacks on the backs (data engineers)
Once the introspective analysis was done, I was content in the knowledge that I had dispensed the right advise as a data practitioner.
Coming back to the Scenario - the Business Head went ahead with his Direct-Implement Approach i.e. without a lot of assessment, sizing and architecture.
What do you think, might have happened? Did they face challenges or were successful all the way?
Share your views and perceptions.
P.s: The outcome was quite interesting; I will share it in a couple of days as a followup.
Data Professional | Author | Career Counselor
3 年Appreciate your responses - Ravindra Nurukurthi, Kiran Cavale, Tanuj Govalkar and Chiranjeev Singh Sabharwal. As promised in the original article - here is what actually happened. The Business Head coerced his primary IT services team to implant 2 ETL developers and went ahead with his approach, i.e. take one data source at a time and ingest into a data repository.?They did derive some immediate success but it didn't last long. Within 6-7 months they were looking for "new insight" to solve their data management problems. Being fair to the Business Head, apart from inappropriate strategies to handle data management projects, they had a couple of other challenges viz.? - Data sensitivity and security - Their data was highly sensitive and thus needed to be kept under high security (isolated network) even within the organization. - People and Change management issues - Folks at the next level of the Business Head had severe job insecurity and (fear of yielding) control issues I hope when they have a go at it again, they do so with the right data management approach.
Manager - Digital Solutions at Worley
3 年I think, in such scenarios - planning, assessment, and sizing also happens in an iterative fashion. Quick wins/ PoCs help us to showcase capabilities to the decision makers and get an in principle agreement for the initiative, however when the actual work starts it would definitely add some amount of re-work. It's always better to "fail fast" and identify root causes earlier in the project than wait and watch. Only drawback of doing it this way is - limited control on timelines and sometimes budget
Delivery Associate Director@ NTT Data
3 年Nimish , you should implement this using the combination of DATA MESH and DATA VAULT 2.0
Data Science | Gen BI + GenAI Evangelist| GCP & Azure | Data Platform Modernization & Transformation | Competency and Capability Development I Cloud Analytics & BI Consulting | IIM-Indore | DU
3 年Can’t agree more Nimish. While both approaches have their pros and cons, especially the instant gratification of 2 data sources going live in a short time using POC approach, the amount of rework and retrofitting which needs to be done later outweighs the initial benefits. Remember the saying “failing to plan is planning to fail”.
Data & Analytics, Digital
3 年Hi Nimish- you may like to look at Data Vault 2.0 as an approach to this problem, this youtube giving client experiences is useful...https://www.youtube.com/watch?v=3PXJrD4GkbA.