登录查看更多内容

A day in the Life of a Data Engineer

Srivatsan Srinivasan

Chief Data Scientist | Gen AI | AI Advocate | YouTuber (bit.ly/AIEngineering)

发布日期: 2019年9月17日

It was summer of 2004, We were tasked on an data mystery to solve that none others in the industry had attempted. Bunch of newly published banking regulations (Basel II) that we had to model and comply to meet the regulatory timeline. 15 years on we have solved many data stories/crimes wearing hat of Data Janitor, Data Detective, Data Engineer and Data Scientist together and also separately with various customers and for multiple organizations

Over the weekend I got a chance to catch up with my partner in data investigation Sudipta BasuRay to get answer on some of the new trends, role of data engineering in data science and also on advise to new comers entering data engineering space

Sudipta is a data engineering leader and hands on practitioner who has enabled some of the fortune 100 companies in their data driven journey. He specializes in designing and delivering data driven products, multi-tenant data engineering solutions, low latency streaming solutions both on Cloud and On-premise

Let's get started on Q&A with Sudipta BasuRay, Senior Director and Chief Architect, AI&A, Cognizant Technologies Solutions..

Where do you see future of on-premise big data stack?. Companies like Cloudera and MapR struggling to turn profitable and also there is huge push in enterprise for cloud native apps, do you see data and cloud native crossing path?

Big Data stack emerged because it allowed us to expand on the distributed processing paradigm. The next frontier of scaling was to separate compute from storage which Cloud and containerized execution turbocharged.

We are yet to solve the problems of access latencies on Blob Storage; yet to make the advances in enabling intelligent cache sitting between compute and storage; managing states within unbounded streams has progressed a lot, yet its not very simple when it comes to design.

We have to challenge our erstwhile foundations that all data has to be together

My opinion is going into the future, we have to challenge our erstwhile foundations that all data has to be together, rethink about data gravity, critically question why it is batch, what will I achieve by moving it from Point A to Point B which I couldn't achieve in-situ

Monoliths have made way for micro-services, data can be anywhere, as long as its accessible, controlled and exchanged in a common way. Every processing pipeline that we make should scale in a way - add more containers and it scales until the time the peak exists and then scales down

Our distributed data pipelines, event processing streams stand to gain in scalability, reliability, operational monitoring, response latencies as Cloud native disciplines allow lot of these to be addressed naturally

Can you talk about importance of data engineering in data science cycle and how does a data engineer typical day looks like?

Without the right data, without it being available in the right structure and it being available within the desired latency , material benefits of data science wont show up

Dissect the Data science life cycle, you will realize that Data Science is only a small part of what we name the life cycle after. The long tail of data preparation and the engineering effort of sustaining the discovery life cycle and the analytic infrastructure to vend the insights is what comprises over 60-65% of effort. Without the right data, without it being available in the right structure and it being available within the desired latency , material benefits of data science wont show up - putting that all together is what Data Engineering is all about

A day in the life of a data engineer is about harvesting the data from upstream leveraging SQL, programming pipelines to translate them into feature rich meaningful prepared datasets and chaining them all together to analytical delivery infrastructure

From exploring the data to design and coding a pipeline that embeds data structures that perform and scale, to being able to code that service that encapsulates that ML model for responding to business problem and to being able to spin up that infrastructure that serves the model - everything is about a little part of the day in the life of a data engineer

What do you see as top challenges within enterprise that typically de-rail data science initiatives?

Building reliable data pipelines and weaving the results of the data science initiatives requires data engineering, data science and software engineering disciplines to converge

Ability to get timely access to fit for purpose data that can support the desired Analytical Outcomes
Building reliable data pipelines and weaving the results of the data science initiatives in a common event processing fabric requires data engineering, data science and software engineering disciplines to converge - Them vs Us and the so-called boundaries continue to inhibit scaled AI pursuits
Hard-wired systems ignore the natural evolution of data, systems that respond to changing data behavior are still in paper
More often than not, for our own selfish reasons, we have kept Data, Analytics as separate disciplines/specializations away from Applications.

These boundaries prevent us to take advantage of low shelf life insights and we try to create separate architecture blueprints to solve what should have been done upstream

With your experience leading large data engineering initiatives, what do you suggest budding data engineers should learn?

I would reflect on my journey of two decades as a data practitioner. Back in the day in 2000 - there was a choice Open Systems vs Mainframe. Ones who got on Open systems, it evolved to Java/J2EE programming vs SQL & databases. Eventually, those who chose databases found themselves getting challenged by ETL tools that took away the last little remnants of programming and divide was complete.

Advent of Big data in circa of 2013-brought programming back into the data world and over a period of time it has morphed to Spark , Scala and Python. Lost in the maze of SQL vs Programming there was one important but latent need - Data structures and algorithms ; And the differentiation between Programmers vs Engineers that became pronounced. Engineers converge all the three disciplines without bias of choosing one over the other.

Long story short - Data Engineering equals all of the below along with a shade of atleast one cloud programming skills :

Strong Developer Skills [ I personally like hearing to the Developer Advocates of Google and their likes more than leaders who talk about trends in AI and Industry impacts). Remember at the end of day everything is a “Code “.
SQL : This is the lingua franca of Data, lets accept it. Good or Bad, Mundane or Classy, we can’t forget the SQL basics and SET theory
Algorithms and Data structures which we thought had a place only in our engineering days will find increased applicability in bringing scale and efficiency to what we build and templatizing/productizing them .

In this ever changing and evolving landscape, what does one do to remain focused on data engineering career goals ?

Data Engineering is not sexy and catchy, but its something Enterprises can’t do without in their AI realization journey

As we embark on this journey or for those who are already are in it, we need to remind ourselves each day:

Data Engineering is not sexy and catchy, but its something Enterprises can’t do without in their AI realization journey. Minus the hype, discover the true workload challenges.
Experience beats education; be hungry to learn because even if it means resetting the clock, it will help and everything adds up. It takes a lifetime to master with everyday being a learning
Work in a place that requires you to do everything at one point in time or other - Infrastructure, monitoring, coding, tuning - more Hats you wear - more Points you collect towards being full stack !!
Community community !!! : Stay connected with the community, this is the new normal to learning, collaborating and growing.
There is no Right Technology , there is no future proofed technology - what is Right today will become Legacy eventually - fundamentals of distributed data processing , scaling do not change, only the means do - Invest time and effort to keep up with what's happening.

And above all choose a mentor who can guide you

You can connect with Sudipta on LinkedIn - https://www.dhirubhai.net/in/sudipta-basuray/

Also, you can follow my YouTube channel where I will be posting information on data science, data engineering and anything on data - https://www.youtube.com/channel/UCwBs8TLOogwyGd0GxHCp-Dw

Matthew Brown

AI System Support ][ UX/UI/Technical Writing ][ Database/Operations/Product/Support System Admin

5 年

#datsplat #aftdat #careers

Deepanjan Jha

Over 18 years of experience in data governance, data architecture, data modelling, Master Data Management & ETL tools. Experienced in design and development of AWS based data platform

5 年

Wonderful insights. This stresses on the points many hopeful data engineers might be already thinking and urges them to act on their thinking's

1 次回应

Shailendra Chaudhary

Machine Learning Enthusiast | Actively looking for opportunities

5 年

@Amey Nawar

查看更多评论

要查看或添加评论，请登录

查看全部

A day in the Life of a Data Engineer

Srivatsan Srinivasan

Chief Data Scientist | Gen AI | AI Advocate | YouTuber (bit.ly/AIEngineering)

Where do you see future of on-premise big data stack?. Companies like Cloudera and MapR struggling to turn profitable and also there is huge push in enterprise for cloud native apps, do you see data and cloud native crossing path?

Can you talk about importance of data engineering in data science cycle and how does a data engineer typical day looks like?

What do you see as top challenges within enterprise that typically de-rail data science initiatives?

With your experience leading large data engineering initiatives, what do you suggest budding data engineers should learn?

In this ever changing and evolving landscape, what does one do to remain focused on data engineering career goals ?

更多精彩文章

社区洞察

其他会员也浏览了

Data Bricks - The New Way to Manage Data Efficiently

Mastering the Flow: Navigating the Currents of Data Collection and Ingestion in Data Engineering Interviews.

Creating a successful big data strategy

Building Scalable Data Pipelines: Key Architectural Choices for High-Performance Solutions

The Importance of Data Engineering in Today's Digital World

DATA Pill #066 - Powering the Latest LLM Innovation, Data contracts and schema enforcement with dbt

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

Unlocking Insights: The Power of Data Engineering

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

Data Council 2022: Building Lakehouse with Delta Lake

Where do you see future of on-premise big data stack?. Companies like Cloudera and MapR struggling to turn profitable and also there is huge push in enterprise for cloud native apps, do you see data and cloud native crossing path?

Can you talk about importance of data engineering in data science cycle and how does a data engineer typical day looks like?

What do you see as top challenges within enterprise that typically de-rail data science initiatives?

With your experience leading large data engineering initiatives, what do you suggest budding data engineers should learn?

In this ever changing and evolving landscape, what does one do to remain focused on data engineering career goals ?

Journey into Data Science - Year of Learning Together

2020年9月8日

How to build a compelling data science portfolio?

2020年5月19日

AIEngineering - Inside Story

2020年2月18日

Course Launch - Scaling and Accelerating Machine Learning Models

2020年2月4日

Skill up on new age data technologies

2019年12月17日

Business and Data Understanding in Data Science Lifecycle

2019年11月18日

Data, Artificial Intelligence and Cloud Trends for 2020 and Beyond

2019年10月29日

Docker and Kubernetes for Data Science

2019年10月16日

A Day in the life of Data Analyst

2019年10月7日

How to stand out in Data Science Interview?

2019年10月1日

社区洞察

其他会员也浏览了

Data Bricks - The New Way to Manage Data Efficiently

Mastering the Flow: Navigating the Currents of Data Collection and Ingestion in Data Engineering Interviews.

Creating a successful big data strategy

Building Scalable Data Pipelines: Key Architectural Choices for High-Performance Solutions

The Importance of Data Engineering in Today's Digital World

DATA Pill #066 - Powering the Latest LLM Innovation, Data contracts and schema enforcement with dbt

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

Unlocking Insights: The Power of Data Engineering

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

Data Council 2022: Building Lakehouse with Delta Lake