登录查看更多内容

Circus of assessment, architecture, sizing and modeling in Data Management

Nimish Somaiya

Data Professional | Author | Career Counselor

发布日期: 2021年5月11日

+ 关注

On receiving a back-of-the-envelope estimate of 5-6 people for six months, the business head was mad at me.

This was for building a data repository and ingestion pipelines for 50+ streaming and micro-batch data sources
A mixed bag of data sources - some received via sophisticated API calls, others scraped from emails or websites while most received as csv/xls
Some streamed in throughout the day, while others were loaded weekly or monthly
Most were daily and the critical ones were streaming in every few minutes
There was also a lot of processed and semi-processed historical data (10 years+) stored in adhoc-ly designed MS-Access databases
There was no dearth of software - Db and Pipeline building tools were easily available.

With my ball-park estimates, I also added five key messages

An assessment (2-3 week exercise) of the environment covering data sources & structures, usage/application, technology and other volumetric information would be needed
A well defined scope, architecture, detailed approach, project plan with headcount and skill would be delivered as output of the assessment exercise
A sizing exercise to determine Infra needs (storage, compute, network) and Db sizing
Data modeling/Db design of the data repository was the crux of the project
Typical project execution involved - One time high level requirements & design and iterative low level design->development->testing->deployment

His contentions were

Why cant you just take each data source one by one and ingest in some Db starting tomorrow?
What is the need for any kind of assessment exercise?
Why do any elaborate sizing now? We will add capacity as and when needed
What is the need for senior data modelers in this assignment?
Why don't you load a couple of data sources in a data repository and show us?

His contentions were irksome and unreasonable but wearing my data practitioner hat I conveyed the relevance of doing the assessment and sizing activities. I don't think I was very successful in my solution proposition and the thread trailed off.

Interestingly, I did find merit in two of his points - i.e.

taking one data source at a time and
loading a couple of data sources as a Proof-of-Concept

Why did I have reservations towards these two points, i couldn't put a finger on it.

Was there any harm in doing a PoC with a couple of data sources? and
Why couldn't we ingest the data sources one by one and finish the activity?
Did we really need all the circus of assessment, architecture, sizing and modeling?

On the 3rd day after my interaction with the Business head when i was crossing Vashi bridge it struck me!

Yes, doing the execution (i.e. software development of ETL pipelines in this scenario) for ingestion of data sources one by one is perfect, but the preceding steps of assessment, high level requirement and design are inevitable for a successful project. There is a high probability of landing up with another ad-hoc, randomly designed data repository without defining a blue-print - architecture, sizing and approach.

Doing it slam-dunk in such data management scenarios does show immediate short-term gains but they are soon eroded in the face of rework effort, patch-work and eventually badly designed and low performance yielding data assets.

The business head's contention was valid but only applicable to the 2nd half of the project. Implement piece-meal but plan, size and design overall.

PoCs are great when its a new piece of technology, new process, new methodology - new to the market, new in the industry and maybe just new in the organization. They help unearth challenges, validate hypotheses, provide exposure and yield confidence. PoCs are a good vehicle to do demonstrate capability and can act as a precursor to a project but due to their tactical emphasis miss out on the overall plan and design elements. If PoCs are segued into an actual project using the code and structures it is most likely heading towards a catastrophe.

Data modelers are the midfielders of the data management projects. Like soccer midfielders they are generally more experienced and have good insight.

Like midfielders in soccer

data modelers support the forwards (visualization and data science developers)
can fallback to support the backs (data engineers)
generally control the game - feed the visualization developers (forwards) and thwart/soften attacks on the backs (data engineers)

Once the introspective analysis was done, I was content in the knowledge that I had dispensed the right advise as a data practitioner.

Coming back to the Scenario - the Business Head went ahead with his Direct-Implement Approach i.e. without a lot of assessment, sizing and architecture.

What do you think, might have happened? Did they face challenges or were successful all the way?

Share your views and perceptions.

P.s: The outcome was quite interesting; I will share it in a couple of days as a followup.

Nimish Somaiya

Data Professional | Author | Career Counselor

3 年

Appreciate your responses - Ravindra Nurukurthi, Kiran Cavale, Tanuj Govalkar and Chiranjeev Singh Sabharwal. As promised in the original article - here is what actually happened. The Business Head coerced his primary IT services team to implant 2 ETL developers and went ahead with his approach, i.e. take one data source at a time and ingest into a data repository.?They did derive some immediate success but it didn't last long. Within 6-7 months they were looking for "new insight" to solve their data management problems. Being fair to the Business Head, apart from inappropriate strategies to handle data management projects, they had a couple of other challenges viz.? - Data sensitivity and security - Their data was highly sensitive and thus needed to be kept under high security (isolated network) even within the organization. - People and Change management issues - Folks at the next level of the Business Head had severe job insecurity and (fear of yielding) control issues I hope when they have a go at it again, they do so with the right data management approach.

1 次回应

Tanuj Govalkar

Manager - Digital Solutions at Worley

3 年

I think, in such scenarios - planning, assessment, and sizing also happens in an iterative fashion. Quick wins/ PoCs help us to showcase capabilities to the decision makers and get an in principle agreement for the initiative, however when the actual work starts it would definitely add some amount of re-work. It's always better to "fail fast" and identify root causes earlier in the project than wait and watch. Only drawback of doing it this way is - limited control on timelines and sometimes budget

Ravindra Nurukurthi

Delivery Associate Director@ NTT Data

3 年

Nimish , you should implement this using the combination of DATA MESH and DATA VAULT 2.0

Chiranjeev Singh Sabharwal

3 年

Can’t agree more Nimish. While both approaches have their pros and cons, especially the instant gratification of 2 data sources going live in a short time using POC approach, the amount of rework and retrofitting which needs to be done later outweighs the initial benefits. Remember the saying “failing to plan is planning to fail”.

Kiran Cavale

Data & Analytics, Digital

3 年

Hi Nimish- you may like to look at Data Vault 2.0 as an approach to this problem, this youtube giving client experiences is useful...https://www.youtube.com/watch?v=3PXJrD4GkbA.

查看更多评论

要查看或添加评论，请登录

Nimish Somaiya的更多文章

When you have a non-techie CTO/CIO

2021年5月18日

When you have a non-techie CTO/CIO

This isn’t an uncommon phenomenon - Nowadays We often see that a non-techie is enthroned as a CIO or CTO. Some of us…
....| AI/ML & Cloud Data Management | Data Science | Bigdata Analytics | Data Management | Analytics | Insight & Data | BAO | IM | BI | Dw | EIS | MIS

2021年5月3日

....| AI/ML & Cloud Data Management | Data Science | Bigdata Analytics | Data Management | Analytics | Insight & Data | BAO | IM | BI | Dw | EIS | MIS

Way back in the late 80s and early 90s, Software professionals coded COBOL programs to generate reports and…

3 条评论

Circus of assessment, architecture, sizing and modeling in Data Management

Nimish Somaiya

Data Professional | Author | Career Counselor

Nimish Somaiya的更多文章

社区洞察

其他会员也浏览了

Hyper-Scalable Data Architectures: Unleashing the Value of Your Data

Who Should Own Data Architecture?

ELEMENTS OF DATA ARCHITECTURE

Data Architecture

Previewing Chapter 4: The Patterns of Data Mesh Architecture

Data Architecture

May edition: UNS case studies and data architecture guidance

The Three Stages of Data Modeling: A Structured Approach to Data Architecture

Data Engineering - Building a future-ready data architecture

Unlocking Data Value with Medallion Architecture: The Power of Microsoft Fabric in Data Engineering

Nimish Somaiya的更多文章

When you have a non-techie CTO/CIO

....| AI/ML & Cloud Data Management | Data Science | Bigdata Analytics | Data Management | Analytics | Insight & Data | BAO | IM | BI | Dw | EIS | MIS

社区洞察

其他会员也浏览了

Hyper-Scalable Data Architectures: Unleashing the Value of Your Data

Who Should Own Data Architecture?

ELEMENTS OF DATA ARCHITECTURE

Data Architecture

Previewing Chapter 4: The Patterns of Data Mesh Architecture

Data Architecture

May edition: UNS case studies and data architecture guidance

The Three Stages of Data Modeling: A Structured Approach to Data Architecture

Data Engineering - Building a future-ready data architecture

Unlocking Data Value with Medallion Architecture: The Power of Microsoft Fabric in Data Engineering