How to Architect for Data Consumption
Credits: Timasu (CC0) https://pixabay.com/en/analytics-google-analytics-1925495/

How to Architect for Data Consumption

This is my pet peeve - technical architects are building systems and applications that make data analysis complicated, error prone, and inefficient. We need enablement of data consumption as a first class requirement of any system that is built. I explain here how we could architect differently.

Technical systems architects, including myself until recently, are used to building systems with considerations such as development time, robustness, and evolvability. Any analytics was an after thought. Having spent a few years crunching data at various scales, I see the world differently. I see barriers all around in systems to analytics. Here are a few thoughts on the nature of barriers and how to address them.

Understanding the Nature of Data Analysis

In order to architect for data, we first need to understand the factors that can make data analysis complex:

  1. Dataset Uncertainty. The specific datasets required for individual analytic tasks are determined by the business need. The dataset cannot always be predicted ahead of time.
  2. Time Uncertainty. Amount of time available depend on business timelines.
  3. Unknowns. Any hidden assumptions or contexts make analytics error prone.
  4. Data Validity. Most analysis assumes data is internally consistent. Errors are embarrassing and will trigger conflict with the engineering team. Reconciliations are very expensive as well.
  5. Deep Changes. Some analysis involves conducting experiments. These can entail deep changes to system architecture.
  6. Data Updates. Analysis sometimes involves updates to data. The interfaces have to be scalable because they often tend to be detailed - touching every record e.g., customer segment for each customer.
  7. Untracked Data and Tooling. Analysts use small and large freestanding tools that are not integrated with the larger platforms. However these tools and their output have to be incorporated into the larger system.
  8. Inefficiency. Business questions are often repetitive and have predictable structures. They often look at the same data and use standard methods.

All Data Should Be (Eventually) Consumable

Any data being captured will be about the service being offered, the system delivering the service, or the consumption of the service. Optimizing and adapting a business will require understanding every aspect of the business and acting on it. It is inevitable that chasing business questions will lead to every data corner.

If some data is never consumed, then that is a signal too. The data and application must be examined for the relevance of the original business objective that led to the collection of the data in the first place. The analytical process too needs to be audited for soundness.

Architecture Should Cover Analytical Systems

Often, the analytics tools and processes are assessed to be “outside the system” being architected, and are therefore not accounted for. Architectural changes impact the dependent analytical tooling, process, and results. The output of the analytical process is often untracked. The execution of the analytical process, especially on larger datasets, is often efficient if it works within the computational framework provided by the application.

Architects should consider analytical frameworks to be a component of the system with two way information exchange with the rest of the system, and provide frameworks to compute, store, and index analytical artifacts. This will allow the analytical process to scale with data, people and questions, and reduce confusion.

Data Integrity and Quality Guarantees

The architecture should specify and guarantee integrity checks for data. Integrity issues, if discovered later, not only invalidate analysis, but also make the analysts (and worse, decision-makers) distrust the data provided by the system. These guarantees should be continuously checked.

Quality is a slightly different challenge. Any ambiguity or gaps in information collected such as timestamps and unstructured text, has an impact on the availability and value of analytics output. Architectural tradeoffs impacting data quality, scope, and availability should be coordinated with the analytics team.

Data Discoverability and Accessibility

Analysts are unlikely to read through design documents and code. The architecture should provide interfaces that allow users to discover what data is being stored in the system, and provide standardized interfaces to access them, short of going to the database. These interfaces should be self-describing, comprehensive, and evolving with the rest of the system. This also implies that the architecture should come with a data governance structure, as well as implementation, management, and auditing as built-in capabilities.

Data Usability

Data cannot be used without metadata that provides the context. Metadata includes the semantics (meaning) of the data but also lineage, assumptions, dependencies, accuracy, and other information. This metadata is useful in determining the appropriateness, value, and scope of analytical processes. For example, if a certain database column is being deprecated, then analysts can update their tools and processes to ignore that column.

This metadata should be explicitly managed through the entire lifetime of the system, and the architecture should provide or support a data catalog that organizes the metadata. As the volume, lifetime, and diversity of data increases, the catalog becomes more and more valuable in focusing attention on the right data. Additional services such as a search may be required to enable quick discovery of relevant metadata. 

Data Lifecycle and Management

Application data keeps evolving with architectural and implementation changes made during the lifetime of the application. The storage and access mechanisms or nature may change due to decisions of what data to keep, throw away and archive. Bugs in applications may require the data to be modified or deprecated. Security policy changes may impact accessibility and coverage of data.

All these changes impact the scope, depth, defensibility, and reproducibility of analysis. The architecture should provide mechanisms like callback hooks and lineage tracking to enable analytical tooling discover impact and find ways of coping with the changes.

Support for Right Abstractions

Business questions and the data that address those questions are often predictable and repetitive. It is not uncommon to repeat analysis for different timeframes, products, or geographies. Duplication in datasets or structures of analysis can be discovered over time and the right abstractions, interfaces and data models, can be created to reduce or eliminate this duplication. The metadata discovery interface should include these abstractions as well, along with the context.

Bulk Data Interfaces

Often the analyses involve constructing detailed models over a large number of records. For example, customer profiles have one record for each of the possibly millions of customers. The system must be updated with these changes. The architecture should provide bulk interfaces to enable efficient updates, and should support interfaces for multiple levels of granularity.

Experiments

Experiment design is increasingly common where product variants are presented to end-customers in systematic ways to help analysts understand end-customer sensitivity towards various product attributes such as pricing and size. This creates multiple control paths through the entire system, and significantly complicates the architecture needed.

Architecting with an awareness of the scope, depth, and likelihood of the experiments will reduce the need for convoluted methods or patches to the architecture.

Summary

Analytics used to be an optional addon. It is increasingly a core component as we make applications more intelligent and responsive. Thinking through the nature of analytics in the context of the business, both today and over the long-term, and the kinds of capabilities that will enable efficient and effective analytics will lead to better overall systems.

Adding data consumption to the goals of a system’s architecture will increase work in the short term in the form of new interfaces and mechanisms but the “data debt” has to be paid at one time or another. Putting in place frameworks and approaches early on will reduce the long term costs.

Technical architects often make assumptions about the use cases of data, knowledge and skill of the data user, and mechanisms to be provided. The data ecosystem is evolving rapidly, and it is best if the assumptions are explicitly identified and tested constantly to enable the systems to be in sync with the emerging needs.

(Hat tip to Premkumar for highlighting a gap)

Dr. Venkata Pingali is an academic turned entrepreneur, and co-founder of Scribble Data. Scribble aims to reduce friction in consuming data through automation.

Benjamin Carter

Manager of Data Analytics at UFCU

3 年

Having experienced every issue discussed from the other side of the table, I just want to say thank you for this article. It is spot on. So many problems can easily be negated by involving the correct people in appropriate conversations at the right time. I can tell from your writing that you ensure this happens to the best of your ability, and I am certain your users appreciate you.

回复
Ravindra Kompella

Vice President, Technology

7 年

Well authored! This, coupled with measuring accuracy and discipline in entering manual data will ensure a healthy completion of data-entry lifecycle, devoid of noise and will help downstream engines immensely to come up with accurate analysis /predictions

回复
Venkat Terugu

Engineer | Founder & CEO @ Ciphercode.ai – Brand-Centric Customer Trust Platform | CTO | Cyber security | Digital Transformation Leader | AI & SaaS Innovator | Startups, Strategic GTM, execution | #Entrepreneurship

7 年

Excellent Article, very fascinating!. It helped me to see things in data perspective. Yes, all consumable data elements and objects should be thought at every block of architecture with simple interfaces defined. Traditionally in Embedded designs, data part is being ignored at initial stage of the solution thought. As you rightly said, this shall be the first-class requirement. I believe scribble is in business at right time, India is slowly moving from data poor to rich and scribble got lot to do there. Wish you all the best!

回复
Naveen Negi

Search | Analytics | SaaS

7 年

By the way there are a lot of modern data applications that have been architected pretty well for consumption I can count many but a cursory glance at Salesforce's apis is a good example of the same however as I said no one size will fit all ; basically however good the data layer or interface one will always need middleware in some form or the another to ensure all applications in enterprise can consume the data .

回复
Naveen Negi

Search | Analytics | SaaS

7 年

Building a data layer that caters to all sorts of consumers is a holy grail it is nice to talk about in concept but fact of the matter is there are business critical applications and interfaces which use their own formats and protocols and they all are incompatible ; that is why there there has always been need for adaptors , integration layers and now API based integration products technology will keep changing but the fundamental nature of the problem won't.

回复

要查看或添加评论,请登录

Venkata Pingali的更多文章

  • Robots Need Not Apply: Job Roles in Enterprise

    Robots Need Not Apply: Job Roles in Enterprise

    [AuthenticHumanTM] Job related anxiety is real and growing. 90% of developers in the Harness 2025 State of Software…

    1 条评论
  • Agentizing Business Process

    Agentizing Business Process

    Feel the AI stones to cross the agentic river TL;DR Agentization of business processes has started Understanding…

    2 条评论
  • Agent-Based Systems Have Arrived: AI Engineer Summit Online 2025

    Agent-Based Systems Have Arrived: AI Engineer Summit Online 2025

    TL;DR: The AI Engineer Online Summit 2025 shows that AI agents are rapidly maturing. The talks had a strong sense of…

    10 条评论
  • Where will LLMs be in the Next 12 Months?

    Where will LLMs be in the Next 12 Months?

    Benchmarks. Normally we like to think of technology development as an independent process dictated by markets.

  • Agents Will Take Over IT Service Management

    Agents Will Take Over IT Service Management

    TL;DR ITSM economics is about to breakdown ITSM has a long tail of use cases because of complexity Agents will be…

    1 条评论
  • [Feb 5] Implementation Experiences with Domain LLMs

    [Feb 5] Implementation Experiences with Domain LLMs

    A lot of theoretical work is happening but delivering it to end customers is still a bit of challenge. This week we…

  • Post-Deepseek World

    Post-Deepseek World

    Deepseek has reset priors of the tech community at large, and opened a much larger application game. Here is a mix of…

    4 条评论
  • Jan 24, 2025 - Knowledge Agents & Economics

    Jan 24, 2025 - Knowledge Agents & Economics

    Welcome! In this edition we have two articles written by me and Rajesh on structure of knowledge agents, and economics…

  • Alignment is Critical: What I’ve Learned About Leading a Cross-Border Startup

    Alignment is Critical: What I’ve Learned About Leading a Cross-Border Startup

    Leading a cross-border organization has taught me that success depends on understanding and adapting to unique…

    6 条评论
  • A Year to Remember

    A Year to Remember

    It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was…

    5 条评论

社区洞察

其他会员也浏览了