Two Critical Dimensions of Data Engineering: Orchestration and Observability
Kirk Borne, Ph.D.
LinkedIn Top Voice, Thinkers360 Top 25 Overall Thought Leader, Founder of Data Leadership Group (Data Scientist. Top Influencer. Speaker. Trainer. Consultant. Astrophysicist). Advisor to PrimeAI and other AI startups.
If you have been active professionally in the data space for any length of time, then you know that the business implementation of any data-fueled product requires a team effort and numerous moving parts (i.e., various pieces of the internal infrastructure and user-facing components that must work together). I personally have been in different roles on such teams over my career, where I have seen the vital necessity of well-choreographed collaboration among those different components.?
I have always worked with data, starting as an academic research scientist (specifically, as an astronomer), then as project scientist on a large astronomy space telescope’s data system, then as project manager overseeing a multiple-team staff who designed, developed, and operated data systems for a suite of different astronomy satellite projects, and ultimately migrated from astronomy research to applying machine learning, data science, and analytics to data, data everywhere: research, teaching, mentoring, advising (data and AI teams, executives, and startups), social promoting, blogging, and working on projects for many different clients and industries.
In the beginning, I was developing code and implementing algorithms on my desktop computer for my own data analysis and data exploration tasks: focusing on the data of my science. But as my career roles progressed, so did my appreciation of the other dimensions of data: the science of all data (data science) and data engineering. As a lifelong learner, I still had much to learn!
I noticed an all-too-common gap that often exists between teams working on science and models, those working on data infrastructure, and those working on user-facing systems. I learned how data engineering was massively important for project success, particularly on?enterprise-scale, whereas data science and modeling tasks were just one step in the?data workflow and in the data+AI=value equation.
What I initially experienced only in my astronomy projects, I saw on a grand scale for many different clients and industries. The same diversity of data engineering requirements, tasks, and data flows existed elsewhere, in every industry.?There was no single vanilla-wrapper data engineering solution for all the different teams, tasks, use cases, outputs, and data products.
All through these years, one aspect of data engineering was always critical (not necessarily a bottleneck, but certainly a gate through which all projects and products had to pass). And that aspect was data orchestration. We may not have used that expression (orchestration), but the criticality of choreography was there, and the importance of having a well-rehearsed and harmonious "orchestra" was there.
That brings us to 2025 and a new podcast that I recently listened to. This is the first of many podcasts to appear in the new 微软 “Tech Innovators Spotlight Podcast” series. The series is devoted to exploring the exhilarating journeys of passionate practitioners in the cutting-edge world of AI innovation. Speakers will include leaders of companies of all sizes who are leveraging AI and Microsoft Azure to drive their success in this rapidly emerging, growing, and evolving universe of transformative technologies. Learn how the speakers and their companies are using these technologies to?transform industries worldwide and to bring rich experiences to customers. Think of the Tech Innovators Spotlight as the podcast where technology meets transformation.
The conversation in this first podcast of the series is all about data orchestration and about one more thing of critical importance. That other thing is data observability. Observability was not a term that we used in the early days, and the necessity for it was not specifically identified, but the need for it was implicit. Fortunately, observability?is now explicitly incorporated into the best of data orchestration platforms and services.
I particularly see observability as having two complementary roles: it is a thing that organizations need to do (the "what") and it is an organizational strategy (the "why").
Data orchestration and data observability are what this new podcast focuses on, and I was quickly drawn into the lively discussion. The host of the podcast is Vrushali Soni, MBA from Microsoft and the guest speaker is Julian LaNeve , the CTO of Astronomer.io.
It was immediately natural for me to be very interested in a company called Astronomer and I was even more excited in listening to these experts go deep into the latest developments in capabilities and platforms for data orchestration and data observability. Astronomer 's two solutions for data orchestration and data observability are Airflow (the industrial strength packaged implementation of Apache Airflow for business and industry users) and Astro Observe (their observability solution), respectively.
Airflow is an open-source data orchestration tool in the Apache ecosystem. It is the most popular data orchestration tool in use today, is downloaded millions of times each month, and is one of the top Apache software foundation projects. As one of the best data engineering tools available in the market, Airflow enables efficient, effective, governed, quality assured, and smooth execution of diverse data flows in large organizations, with the ability to handle thousands of different pipelines, teams, use cases, data sources, data outputs, and end-users. The team at Astronomer recently celebrated the 10th anniversary of the initial release of Airflow.
Astro Observe, a new observability product from Astronomer, helps their customers monitor the reliability and quality of their data inputs, pipelines, and outputs. This tool is particularly beneficial for organizations with extensive data ecosystems, as it provides insights into data flow and dependencies across multiple teams.
领英推荐
Astro Observe provides critical monitoring and visibility into those data flows: the who, what, when, where, how, and why for any specific project or task. Who is using a specific set of data? What specific data parcels are they using? When did they use it (including which version of the data was being used - the data provenance)? Where did the data (inputs and output products) flow across the organization, between team members, and to end-users? How was the data handled and used (including updates, manipulations, and uses within data science, machine learning, analytics, and AI pipelines)? Why was the data used (specifically, how did the data use align with business objectives, goals, mission, or other enterprise activities)?
Another key component of Astronomer’s data engineering solutions is to identify and call out duplications. That includes duplicate copies of the data, duplicate (overlapping and/or identical) uses of the data within the enterprise, and duplicate data products generated from the data across the enterprise.
Here are a few more details about the podcast. As I mentioned above, Julian LaNeve is the CTO of Astronomer , a data infrastructure company that specializes in helping organizations manage and run data pipelines at scale using Apache Airflow. Julian discusses the growing importance of data orchestration and data observability in the context of AI and how Astronomer is positioned to support businesses in leveraging their data effectively for AI.
Needless to say, AI is the hottest tech on planet Earth right now, and data is the critical fuel for AI. We know the expression "garbage in, garbage out". Well, Astronomer delivers exactly the opposite: "high-quality data in, high-quality data products out", with visibility, quality assurances, timeliness,?optimization, provisioning, monitoring, and automation in the orchestration and observability functions.
Julian highlights the rapid adoption of generative AI, particularly large language models (LLMs), and the necessity of integrating these models with up-to-date data. (The language models are only as up to date as the data that they are trained on.) He explains the concept of retrieval-augmented generation (RAG), where LLMs pull relevant information from current sources to provide accurate responses, emphasizing that the success of AI applications hinges on timely and reliable data pipelines.
The “Tech Innovators Spotlight Podcast” series podcasts will highlight valuable insights around successes and learnings that these leaders have experienced?on their AI and innovation journeys. “Learnings” includes lessons from projects that succeeded and from projects that failed – a welcome component in any discussion of innovation: “fail fast to learn fast.” Consequently, this first episode addresses the challenges organizations face when transitioning from prototype AI models to production. Julian notes that many companies struggle with this transition due to the complexities of data management. He advocates for a data engineering-first approach, which involves building robust data pipelines that can feed reliable data into AI models, thereby facilitating a smoother path to production.
Rounding out the podcast discussion are highlights of the amazing success of the Microsoft-Astronomer partnership in delivering orchestration and observability solutions to their joint and individual clients, including Airflow and Astro Observe in concert with Microsoft Fabric and Azure Cloud.
Julian explains how this collaboration enhances the discoverability of Astronomer's services within the Azure ecosystem, making it easier for companies to access data engineering capabilities. He emphasizes the importance of orchestration in managing data at scale and how Astronomer’s solutions address this need.
Throughout the episode, Julian expresses excitement about the future of AI and the critical role that data will play in shaping its trajectory. He encourages listeners to explore Astronomer 's offerings, including a free trial that allows potential customers to experience the platform firsthand.
In summary, this podcast episode provides valuable insights into the intersection of data engineering and AI, highlighting the importance of reliable data pipelines, orchestration, observability tools, and strategic partnerships in driving successful AI initiatives.
To learn more about all these things, listen to this first of many podcasts in the Tech Innovators Spotlight series at this link. As the tagline for the series states, “be inspired by the unique value propositions behind groundbreaking products and see how AI is shaping the future, one innovation at a time. Tech Innovators Spotlight?brings you?real-world stories about the power of AI, along with valuable insights around successes and learnings that these leaders have experienced?on this journey.”
Check out the first episode’s conversation now about AI and innovation in the Microsoft-Astronomer partnership; and then come back later for more episodes in the series to come soon.
Serial Entrepreneur & Founding Partner at CRONUTS.DIGITAL | Innovating Business Growth Through AI & Digital Strategies | Expert in Scaling & Transforming SMEs
4 周Sounds like a must-listen! ?? AI innovation in the enterprise is moving fast, and orchestration and observability are critical for scalability and reliability. One challenge I see is that many organizations focus on building AI models but underestimate the importance of data pipelines and monitoring. Without strong data engineering, even the best models fail in real-world deployment.
Follow me for emerging tech, leadership and growth topics | World Champion turned Cyberpreneur | Co-Founder & CEO, TRUSTBYTES
1 个月The 'Tech Innovators Spotlight' podcast delves into the future of enterprise AI.
TechMode.io Co-Founder, Tech Enthusiast, B2B Marketer, Content Creator. Follow me on X @Chels_LA
1 个月Thanks for sharing Kirk Borne, Ph.D., this is a great read out! The #TechInnovatorsSpotlight is an informative and entertaining podcast. It's definitely worth a listen ?
Co-Founder at TechMode.io | Technology Thought Leader and Content Creator | B2B Marketing Expert
1 个月Great write up here, Kirk Borne, Ph.D. #TechInnovatorsSpotlight is a fantastic listen, well worth checking out!
TIME “man of action” | Tech, Digital Transformation, and Marketing Strategist | Tech For Good. | Author. | Rutgers U adjunct. | Mayor Emeritus. | Attorney. | Keynote Speaker. | Veteran. | Sustainability. | SDGs??????????
1 个月Thanks for sharing your deep experience and keen insights, Kirk Borne, Ph.D.. Looking forward to the series!