Data Management in the Age of AI/GPT—One Size Does Not Fit All

Data Management in the Age of AI/GPT—One Size Does Not Fit All

By Geoffrey Moore

Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality

Given all the excitement around ChatGPT, everyone and their mother is talking about the importance of data, how it is the new oil, the lifeblood of artificial intelligence, the key to data-driven decision-making, the fuel for automating real-time operations, etc., etc., etc.?All of which is to say, data has become a core asset and needs a formal management program dedicated specifically to its issues.

To date, however, data management has largely been the purview of the storage sector where all data at the end of the day is just a lot of 0’s and 1’s.?God bless the storage folks, for without them we would be nowhere, but as IT accelerates into the future, data is no longer a homogenous category.?Like everything else, it needs to be parsed.?

Here is my cut at it.

No alt text provided for this image

Beginning with quadrant 1, the data we are most familiar with comes from our Systems of Record.?This data is focused on recording transactions and is both structured, typically by relational database tables, and curated, subject to audit on an annual basis.?For the first two decades of my career, this was the only data that was computerized and accessible to end users.??As it accumulated, it spawned management information systems which subsequently were extended by data warehouses and data marts which were in turn interrogated by a host of business intelligence tools for both visualization and analysis.?This effort continues to this day as business professionals seek insights to guide strategy, adjust management priorities, and revise resource allocation. ?That said, there has been no material disruption in this quadrant during this century, so our attention today is properly focused on the other three quadrants.

The data in quadrant 2 comes from the logs of any device, application, or system that records its own operations, of which computers themselves represent a substantial subset.?The function of a log file is to record every incident from the simplest click to an operational outage or a malicious hack.?The data is structured around time stamps and recorded linearly, but it is not curated in any other way.?

Initially, log files were used primarily for forensic purposes to investigate and correct any malfunctions.?Today, however, as machine learning has grown in capacity and capability, it is increasingly mined for patterns that can detect trends and trigger real-time responses to them.?High-frequency stock trading was an early application of this kind, followed by dynamic placement of digital ads, security alerts, and fraud detection, and now increasingly predictive maintenance.?It takes significant investment to train and tune these programs, but fortunately, reinforcement learning and related technologies are now letting computers conduct much of this effort on their own.?From a technology adoption perspective, because there is still a lot of friction to overcome to get up and running, it is early days, but wherever the use case involves sufficiently high stakes, you can be sure they will proliferate rapidly.?

From a data management perspective, the key principle is that we should archive log data, not throw it away.?This was not the case in earlier decades when storage costs made this practice prohibitive, but once Google showed us the way, and once the hyperscalers helped drive down the cost of storage, it has become the new normal.?That said, the data is not precious at the atomic level, it just needs to be available en masse to train machine learning applications.

The data from quadrant 3 represents natural language data from two different domains.?The first comes from the Worldwide Web and its massive ecosystem of supporting apps.?This massive trove of publicly shared information from every walk of life puts the collective psyche of the human race on display, making it accessible to anyone from anthropologists to marketing professionals to public sector enterprises seeking to mine it for insights.?The second major domain of natural language data comes from all the internal communications threads that enable any given enterprise to develop, execute, and monitor its business and operating plans—in effect, its knowledge base.?

The data in both domains is structured around events, communicated in natural language via email and collaborative applications, and in neither domain is it curated.?Instead, there is simply history, tracing a never-ending flow of dialog that underpins an ever-evolving set of commitments adjusting to an ever-changing marketplace of circumstances.?One would like to convert all this knowledge into a curated and consultable knowledge base, but the river never stops flowing, and sooner or later we humans get tired or bored or distracted.?As a result, our curation quality begins to flatline or even degrade, and the project, although not exactly abandoned, is reluctantly sidelined.?

But wait, there’s hope.?Welcome to quadrant 4, the fourth data domain, the home of curated knowledge.?Our first instance of self-organizing curated knowledge at scale we got thanks to the work of Jimmy Wales and his colleagues at Wikipedia.?Its open-source collaboration enabled an unprecedented advance in speed and scale of knowledge curation, and for many of us, myself included, it has been a game-changer, enabling us to write authoritatively without actually being authorities ourselves.?But this effort too eventually is ultimately dependent upon human-powered processes, and the wear and tear of perpetual maintenance eventually takes its toll here as well.

That’s what makes ChatGPT and Generative AI such an exciting development.?Technically, it is grounded in Large Language Models that enable the development of increasingly knowledgeable replies based on mathematical manipulation of tokenized word sets.?But pragmatically, what generative chat programs are actually delivering are self-organizing knowledge bases that never get tired, bored, or distracted from their self-curation efforts.?Instead, they continue to get better and better over time, asymptotically approaching world-class expertise.?

The key to this success, it should be noted, is the quality and validity of the body of texts they are consuming in the process.?The more expert your system purports to be, the more any “garbage in” can pose a real liability.?That makes data management a top priority for this quadrant.?As we learned from the quality movement, you cannot “inspect in” quality after the fact.?In the case of AI GPT, it is too late to edit the output.?Instead, one must control the corpora of texts which are input to their training and development.?Much of that input, it should be said, comes from an ongoing dialog between the prompter and the application, so keeping a log of that dialog must also be part of the overall data management strategy.?But this data really is worth curating—it is your enterprise knowledge base—so investing here to secure AI GPT success is a high ROI bet.

Key Takeaways

  1. When we talk about the need to manage our data, realize we are talking about four different categories of data, each of which warrants its own management system.?We have reliable best practices for managing the data in quadrant 1, and frankly, we are still finding our way with the other three.
  2. The data in log files, once it has been exposed to intense machine learning, has already proven to yield optimization insights that can be translated into real-time performance improvement algorithmically.?The challenge is that the life cycle of our industrial systems is long, and most of the equipment in our installed base was never designed to report out its data externally.?The amount of labor it takes to retrofit these systems is extensive, so for the time being, you need a strong ROI case to make.?Energy savings from retrofitting residential or commercial buildings don’t make the cut at the present time, but pharmaceutical manufacturing lines and hyperscale data centers do, in part because there are additional value propositions at stake, including the impact of service level quality and the loss of operational expertise to retirement.
  3. The data in the public domain is too vast and heterogeneous to manage.?Enterprises can still make use of it but only through carefully controlled experiments and verification.?Sentiment analysis, brand perception, and influencer power analysis are all use cases here, but the signal-to-noise ratio is too low to bet heavily on their conclusions.
  4. By contrast, the data inside enterprise communications systems represents an exceptional opportunity to create powerful knowledge bases.?There are governance issues to work through here, specifically around employee privacy and enterprise entitlement, but the goal is too worthy to let that hold you back.?The more knowledgeable your workforce is, the more value you can contribute to the world.?There really is no excuse not to lean in here.

?

That’s what I think.?What do you think?


Follow Geoff on LinkedIn | Geoffrey Moore Main Mailing List | Infinite Staircase Mailing List

________________________________________________________________

Geoffrey Moore | Infinite Staircase Site | Geoffrey Moore Twitter | Infinite Staircase Twitter | Facebook | YouTube

Steven Spieczny

Kognic | Accelerating Embodied AI

10 个月

I might expand the ownership of data as introduced here well beyond IT. That is too narrow. This is particularly true for your 2nd quadrant, where we might also expand beyond "log files" as the best representation of what sits in this bucket. A specific example from a large and exploding segment, is data used to train the "machine" from complex sensor-fusion (LiDAR, Radar, Camera, et al) for use in automated and growing autonomous mobility. These datasets are product management artifacts used to power the products used in applied AI and are rarely (currently) managed by IT. Same for Marketing data, more often than not mirroring that of the most dynamic of IT-managed data pools, but generated, captured and parsed in separate, shadow infrastructures. It's a brave new world. Better data = Better AI.

Kajol Patel

Partner Alliance Marketing Operations at Data Dynamics

1 年

Great insights, Geoffrey Moore! Managing log data for real-time performance improvement, leveraging enterprise communications for knowledge bases, and the transformative potential of ChatGPT are key takeaways. Exciting times ahead! #DataManagement #AI #ChatGPT

回复
John Murphy

mTuitive, Inc. Founder, CFO & Director

1 年

Once again, your ability to simplify an extremely complex challenge to explain is amazing. Thank you for at least the 100th time.

Brian Wood, CISSP

Senior Technical Business Strategy Analyst | Strategic Insights, Emerging Technologies, Market Research

1 年

There is still the potential bias issue...willful (delberate distortion of input to the model), and "accidental" (misplaced decimal place, wrong units of measure)...is there such a thing as "machine inlearning" in the event of discovery of bad input?

Gaurav Vaid

The Product Guy ? Championing "Purpose Driven Innovation" ? 3X Top LinkedIn Voice ? Founding Partner @ Venturis Inc with the stated mission of "Bridging The Valleys" ? Global Citizen

1 年

Great insights, as always!! I could not agree more with regulating the output is too late so focus has to be on the input. As you note - for enterprises, controlling the input is a less onerous problem than controlling the input in public domain. I do think that the problem of scale will present itself here as well and once an enterprise crosses a threshold, controlling input will become more challenging. Perhaps a playbook as per the maturity cycle/scale of an enterprise may be needed, however fully agree that there really is no excuse to lean in here

要查看或添加评论,请登录

Geoffrey Moore的更多文章

  • Coming to Terms with Agentic AI—A Playbook for Business Decision Makers

    Coming to Terms with Agentic AI—A Playbook for Business Decision Makers

    By Geoffrey Moore Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality Agentic…

    21 条评论
  • Objective Morality

    Objective Morality

    Without a belief in a divine source, how do atheists justify the existence of objective moral values and duties? By…

    23 条评论
  • Question #2: Existence of Consciousness

    Question #2: Existence of Consciousness

    What is the atheist explanation for the existence of consciousness and subjective experiences? By Geoffrey Moore Author…

    15 条评论
  • Life-Cycle Marketing—Where Are We?

    Life-Cycle Marketing—Where Are We?

    By Geoffrey Moore Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality As the…

    13 条评论
  • 10 Tough Questions Atheists Often Encounter

    10 Tough Questions Atheists Often Encounter

    By Geoffrey Moore Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality This is…

    37 条评论
  • Disruptive Innovation—The Game is Changing

    Disruptive Innovation—The Game is Changing

    By Geoffrey Moore Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality We’ve…

    60 条评论
  • How does culture form?

    How does culture form?

    By Geoffrey Moore Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality We are…

    11 条评论
  • Zone to Win: Organizing within Zones—Some Lessons Learned

    Zone to Win: Organizing within Zones—Some Lessons Learned

    By Geoffrey Moore Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality Zone to…

    9 条评论
  • Can we choose our emotions, or do they happen to us?

    Can we choose our emotions, or do they happen to us?

    By Geoffrey Moore Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality This is…

    27 条评论
  • What about “Non-Founder Mode”?

    What about “Non-Founder Mode”?

    By Geoffrey Moore Author – The Infinite Staircase: What the Universe Tells Us About Life, Ethics, and Mortality Last…

    17 条评论

社区洞察

其他会员也浏览了