A day in the Life of a Data Engineer
Srivatsan Srinivasan
Chief Data Scientist | Gen AI | AI Advocate | YouTuber (bit.ly/AIEngineering)
It was summer of 2004, We were tasked on an data mystery to solve that none others in the industry had attempted. Bunch of newly published banking regulations (Basel II) that we had to model and comply to meet the regulatory timeline. 15 years on we have solved many data stories/crimes wearing hat of Data Janitor, Data Detective, Data Engineer and Data Scientist together and also separately with various customers and for multiple organizations
Over the weekend I got a chance to catch up with my partner in data investigation Sudipta BasuRay to get answer on some of the new trends, role of data engineering in data science and also on advise to new comers entering data engineering space
Sudipta is a data engineering leader and hands on practitioner who has enabled some of the fortune 100 companies in their data driven journey. He specializes in designing and delivering data driven products, multi-tenant data engineering solutions, low latency streaming solutions both on Cloud and On-premise
Let's get started on Q&A with Sudipta BasuRay, Senior Director and Chief Architect, AI&A, Cognizant Technologies Solutions..
Where do you see future of on-premise big data stack?. Companies like Cloudera and MapR struggling to turn profitable and also there is huge push in enterprise for cloud native apps, do you see data and cloud native crossing path?
Big Data stack emerged because it allowed us to expand on the distributed processing paradigm. The next frontier of scaling was to separate compute from storage which Cloud and containerized execution turbocharged.
We are yet to solve the problems of access latencies on Blob Storage; yet to make the advances in enabling intelligent cache sitting between compute and storage; managing states within unbounded streams has progressed a lot, yet its not very simple when it comes to design.
We have to challenge our erstwhile foundations that all data has to be together
My opinion is going into the future, we have to challenge our erstwhile foundations that all data has to be together, rethink about data gravity, critically question why it is batch, what will I achieve by moving it from Point A to Point B which I couldn't achieve in-situ
Monoliths have made way for micro-services, data can be anywhere, as long as its accessible, controlled and exchanged in a common way. Every processing pipeline that we make should scale in a way - add more containers and it scales until the time the peak exists and then scales down
Our distributed data pipelines, event processing streams stand to gain in scalability, reliability, operational monitoring, response latencies as Cloud native disciplines allow lot of these to be addressed naturally
Can you talk about importance of data engineering in data science cycle and how does a data engineer typical day looks like?
Without the right data, without it being available in the right structure and it being available within the desired latency , material benefits of data science wont show up
Dissect the Data science life cycle, you will realize that Data Science is only a small part of what we name the life cycle after. The long tail of data preparation and the engineering effort of sustaining the discovery life cycle and the analytic infrastructure to vend the insights is what comprises over 60-65% of effort. Without the right data, without it being available in the right structure and it being available within the desired latency , material benefits of data science wont show up - putting that all together is what Data Engineering is all about
A day in the life of a data engineer is about harvesting the data from upstream leveraging SQL, programming pipelines to translate them into feature rich meaningful prepared datasets and chaining them all together to analytical delivery infrastructure
From exploring the data to design and coding a pipeline that embeds data structures that perform and scale, to being able to code that service that encapsulates that ML model for responding to business problem and to being able to spin up that infrastructure that serves the model - everything is about a little part of the day in the life of a data engineer
What do you see as top challenges within enterprise that typically de-rail data science initiatives?
Building reliable data pipelines and weaving the results of the data science initiatives requires data engineering, data science and software engineering disciplines to converge
- Ability to get timely access to fit for purpose data that can support the desired Analytical Outcomes
- Building reliable data pipelines and weaving the results of the data science initiatives in a common event processing fabric requires data engineering, data science and software engineering disciplines to converge - Them vs Us and the so-called boundaries continue to inhibit scaled AI pursuits
- Hard-wired systems ignore the natural evolution of data, systems that respond to changing data behavior are still in paper
- More often than not, for our own selfish reasons, we have kept Data, Analytics as separate disciplines/specializations away from Applications.
These boundaries prevent us to take advantage of low shelf life insights and we try to create separate architecture blueprints to solve what should have been done upstream
With your experience leading large data engineering initiatives, what do you suggest budding data engineers should learn?
I would reflect on my journey of two decades as a data practitioner. Back in the day in 2000 - there was a choice Open Systems vs Mainframe. Ones who got on Open systems, it evolved to Java/J2EE programming vs SQL & databases. Eventually, those who chose databases found themselves getting challenged by ETL tools that took away the last little remnants of programming and divide was complete.
Advent of Big data in circa of 2013-brought programming back into the data world and over a period of time it has morphed to Spark , Scala and Python. Lost in the maze of SQL vs Programming there was one important but latent need - Data structures and algorithms ; And the differentiation between Programmers vs Engineers that became pronounced. Engineers converge all the three disciplines without bias of choosing one over the other.
Long story short - Data Engineering equals all of the below along with a shade of atleast one cloud programming skills :
- Strong Developer Skills [ I personally like hearing to the Developer Advocates of Google and their likes more than leaders who talk about trends in AI and Industry impacts). Remember at the end of day everything is a “Code “.
- SQL : This is the lingua franca of Data, lets accept it. Good or Bad, Mundane or Classy, we can’t forget the SQL basics and SET theory
- Algorithms and Data structures which we thought had a place only in our engineering days will find increased applicability in bringing scale and efficiency to what we build and templatizing/productizing them .
In this ever changing and evolving landscape, what does one do to remain focused on data engineering career goals ?
Data Engineering is not sexy and catchy, but its something Enterprises can’t do without in their AI realization journey
As we embark on this journey or for those who are already are in it, we need to remind ourselves each day:
- Data Engineering is not sexy and catchy, but its something Enterprises can’t do without in their AI realization journey. Minus the hype, discover the true workload challenges.
- Experience beats education; be hungry to learn because even if it means resetting the clock, it will help and everything adds up. It takes a lifetime to master with everyday being a learning
- Work in a place that requires you to do everything at one point in time or other - Infrastructure, monitoring, coding, tuning - more Hats you wear - more Points you collect towards being full stack !!
- Community community !!! : Stay connected with the community, this is the new normal to learning, collaborating and growing.
- There is no Right Technology , there is no future proofed technology - what is Right today will become Legacy eventually - fundamentals of distributed data processing , scaling do not change, only the means do - Invest time and effort to keep up with what's happening.
And above all choose a mentor who can guide you
You can connect with Sudipta on LinkedIn - https://www.dhirubhai.net/in/sudipta-basuray/
Also, you can follow my YouTube channel where I will be posting information on data science, data engineering and anything on data - https://www.youtube.com/channel/UCwBs8TLOogwyGd0GxHCp-Dw
AI System Support ][ UX/UI/Technical Writing ][ Database/Operations/Product/Support System Admin
5 年#datsplat #aftdat #careers
Over 18 years of experience in data governance, data architecture, data modelling, Master Data Management & ETL tools. Experienced in design and development of AWS based data platform
5 年Wonderful insights. This stresses on the points many hopeful data engineers might be already thinking and urges them to act on their thinking's
Machine Learning Enthusiast | Actively looking for opportunities
5 年@Amey Nawar