Architecting Reliable Platforms for the AI driven future
Generated with Google Gemini

Architecting Reliable Platforms for the AI driven future

It’s been a while since my last update, and I’m excited to reconnect with you through this new edition.

During this period of silence, I’ve been deeply engaged in the fascinating world of Artificial Intelligence (AI) and building some complex data platforms. As the initial hype around AI has started to settle, it has given me a valuable opportunity to learn, experiment, and interact with various industry leaders, product owners, consulting partners and innovative product companies.

I’ve been keeping a close eye on the exciting innovations in the AI landscape, including remarkable offerings such as Google’s Gemini, Meta’s Llama3, Anthropic’s Claude3, OpenAI’s GPT-4, and of course, Apple Intelligence.

While it’s still early to predict who will secure a larger market share, my analysis suggests that Nvidia and OpenAI are the new players making significant strides. Other players, meanwhile, seem to be focusing on maintaining their economic moat by enhancing user experiences through AI, at least in the short to medium term. For instance, it’s unlikely that existing O365 customers would switch to Google Apps just because Google's AI Gemini outperforms Microsoft Copilot. Similarly, the introduction of generative art capabilities in Adobe Firefly is unlikely to cause a mass migration from other favored photo editors.

Interestingly, I foresee a potential shift in the search engine market. Google’s dominance may be challenged in the medium to long term as Bing is making impressive progress within the enterprise sector.

On a personal note, I have discontinued my OpenAI subscription and am now fully utilizing WebUI-Ollama-Llama3, Gemma, Mixtral, along with Microsoft Copilot for tasks within our company’s firewall. I’m also encouraging my mentees to integrate Cody and Codellama into their development workflows for routine tasks.

AI system is a three-legged stool

Visualize a robust three-legged stool, where each leg signifies a crucial element. The seat of the stool is a metaphor for the AI system, which depends on all three legs for stability and functionality.

  1. Foundational Data: This leg stands for high-quality, dependable data, which forms the bedrock upon which AI models are constructed. It underscores the significance of precise and comprehensive data in steering AI’s decision-making capabilities.
  2. AI Models: This leg represents the advanced algorithms and machine learning techniques that empower AI systems to learn from data and make predictions or decisions. It highlights the necessity for sturdy AI models capable of effectively analyzing and processing data.
  3. User Experience: This leg symbolizes intuitive, user-friendly interfaces that are vital for effectively conveying AI’s insights and outputs to humans (end users). It stresses the importance of designing AI systems that are user-friendly and deliver valuable results.

I use the term “foundational” with great care because without high-quality data and a functioning data governance model, no AI project can succeed.

What makes a foundational data platform, great?

The three architectural cornerstones that support any enterprise-grade data platform (or any application, for that matter) are Reliability, Scalability, and Maintainability.

Let’s start with Reliability.

Picture a symphony orchestra, where each musician contributes to a harmonious melody. What happens if one musician hits a wrong note? The symphony doesn’t halt; it continues, albeit with a slight hiccup. This is the essence of system reliability - the system continues to function even when things don’t go as planned. These hiccups, or ‘faults’, are distinct from ‘failures’. A failure is when the entire system ceases to function, while a fault is a dip in performance or an error in one component, but the overall system continues to operate. Building reliable systems is much like orchestrating a fault-tolerant symphony, where the melody persists, despite the occasional off-key note.

Faults can stem from various sources. Hardware faults can occur in disks, memory, Interface cards, power systems, network, CPUs/GPUs, and so on.

For example, the average hard disk has a Mean Time To Failure (MTTF) of 10 years. So, in a system with 10K disks, you can expect about 2.74 disks to fail per day. Reliable systems address this by incorporating redundancy, like RAID configurations. With the advent of virtualization and cloud technologies, we now have a whole new set of parameters to consider for redundancy and fault tolerance.

Software faults could be dormant bugs that only surface under specific conditions (like a Leap Year or World Cup Finals) or newer types of bugs introduced by the AI systems, such as hallucinations. Faults could also arise from end users performing something unusual that wasn’t anticipated during the design phase. As applications are being built as micro-services, there could be cascading errors from even a single insignificant service in the midst of critical flow, much like the butterfly effect.

So, how do we measure Reliability?

  • Mean Time Between Failures (MTBF): The average time between failures. For instance, a banking app with an MTBF of 10 hours means that on average, it takes 10 hours for a failure to occur.
  • Failure Rate: The number of failures per unit of time. For example, a web-based ticketing system experiencing 5 failures per week results in a failure rate of 0.07 (5 per week divided by 72 hours per week).
  • Mean Time To Failure (MTTF): The average time to the first failure. For instance, a payment processing system with an MTTF of 500 hours means that on average, it takes 500 hours for a failure to occur.
  • Availability: The percentage of time the software is operational and available for use. For example, an e-commerce platform with an availability of 99.5% means that it is operational and available for use 99.5% of the time. This means the system is expected to have a downtime of approximately 43.8 hours in a year.
  • Reliability Function: A mathematical function that describes the probability of failure over time. For instance, a reliability function might describe the probability of a software component failing within a certain timeframe (e.g., 1% chance of failure within the first hour).
  • Fault Tolerance: The ability of the software to continue operating despite the presence of faults or errors. For example, a distributed database system with fault tolerance means that even if one node fails, the other nodes can continue to operate and provide data without interruption.
  • Error-Free Operations: The percentage of time the software operates without errors. For example, a scientific simulation software with an error-free operations rate of 99.8% means that it operates without errors for approximately 99.8% of its runtime.

I recommend to check this excellent source for further details about these metrics.

In the subsequent editions, I will delve into the other two crucial architectural considerations - Scalability and Maintainability. Stay tuned!

Andre Mueninghoff

Business IT Senior Program Manager | Business Process Service Delivery Manager | Director - Technology

4 个月

Excellent blog…and excellent point, “I use the term “foundational” with great care because without high-quality data and a functioning data governance model, no AI project can succeed.”

回复
Jeffrey H. Dobin

Responsible AI & Privacy-Tech Evangelist | Podcast Host | Marathon Pull-Up Athlete

5 个月

Reliability, trust and safety are key!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了