Data Management in the Age of AI/GPT—One Size Does Not Fit All
Geoffrey Moore
Author, speaker, advisor, best known for Crossing the Chasm, Zone to Win and The Infinite Staircase. Board Member of nLight, WorkFusion, and Phaidra. Chairman Emeritus Chasm Group & Chasm Institute.
Given all the excitement around ChatGPT, everyone and their mother is talking about the importance of data, how it is the new oil, the lifeblood of artificial intelligence, the key to data-driven decision-making, the fuel for automating real-time operations, etc., etc., etc.?All of which is to say, data has become a core asset and needs a formal management program dedicated specifically to its issues.
To date, however, data management has largely been the purview of the storage sector where all data at the end of the day is just a lot of 0’s and 1’s.?God bless the storage folks, for without them we would be nowhere, but as IT accelerates into the future, data is no longer a homogenous category.?Like everything else, it needs to be parsed.?
Here is my cut at it.
Beginning with quadrant 1, the data we are most familiar with comes from our Systems of Record.?This data is focused on recording transactions and is both structured, typically by relational database tables, and curated, subject to audit on an annual basis.?For the first two decades of my career, this was the only data that was computerized and accessible to end users.??As it accumulated, it spawned management information systems which subsequently were extended by data warehouses and data marts which were in turn interrogated by a host of business intelligence tools for both visualization and analysis.?This effort continues to this day as business professionals seek insights to guide strategy, adjust management priorities, and revise resource allocation. ?That said, there has been no material disruption in this quadrant during this century, so our attention today is properly focused on the other three quadrants.
The data in quadrant 2 comes from the logs of any device, application, or system that records its own operations, of which computers themselves represent a substantial subset.?The function of a log file is to record every incident from the simplest click to an operational outage or a malicious hack.?The data is structured around time stamps and recorded linearly, but it is not curated in any other way.?
Initially, log files were used primarily for forensic purposes to investigate and correct any malfunctions.?Today, however, as machine learning has grown in capacity and capability, it is increasingly mined for patterns that can detect trends and trigger real-time responses to them.?High-frequency stock trading was an early application of this kind, followed by dynamic placement of digital ads, security alerts, and fraud detection, and now increasingly predictive maintenance.?It takes significant investment to train and tune these programs, but fortunately, reinforcement learning and related technologies are now letting computers conduct much of this effort on their own.?From a technology adoption perspective, because there is still a lot of friction to overcome to get up and running, it is early days, but wherever the use case involves sufficiently high stakes, you can be sure they will proliferate rapidly.?
From a data management perspective, the key principle is that we should archive log data, not throw it away.?This was not the case in earlier decades when storage costs made this practice prohibitive, but once Google showed us the way, and once the hyperscalers helped drive down the cost of storage, it has become the new normal.?That said, the data is not precious at the atomic level, it just needs to be available en masse to train machine learning applications.
The data from quadrant 3 represents natural language data from two different domains.?The first comes from the Worldwide Web and its massive ecosystem of supporting apps.?This massive trove of publicly shared information from every walk of life puts the collective psyche of the human race on display, making it accessible to anyone from anthropologists to marketing professionals to public sector enterprises seeking to mine it for insights.?The second major domain of natural language data comes from all the internal communications threads that enable any given enterprise to develop, execute, and monitor its business and operating plans—in effect, its knowledge base.?
The data in both domains is structured around events, communicated in natural language via email and collaborative applications, and in neither domain is it curated.?Instead, there is simply history, tracing a never-ending flow of dialog that underpins an ever-evolving set of commitments adjusting to an ever-changing marketplace of circumstances.?One would like to convert all this knowledge into a curated and consultable knowledge base, but the river never stops flowing, and sooner or later we humans get tired or bored or distracted.?As a result, our curation quality begins to flatline or even degrade, and the project, although not exactly abandoned, is reluctantly sidelined.?
领英推荐
But wait, there’s hope.?Welcome to quadrant 4, the fourth data domain, the home of curated knowledge.?Our first instance of self-organizing curated knowledge at scale we got thanks to the work of Jimmy Wales and his colleagues at Wikipedia.?Its open-source collaboration enabled an unprecedented advance in speed and scale of knowledge curation, and for many of us, myself included, it has been a game-changer, enabling us to write authoritatively without actually being authorities ourselves.?But this effort too eventually is ultimately dependent upon human-powered processes, and the wear and tear of perpetual maintenance eventually takes its toll here as well.
That’s what makes ChatGPT and Generative AI such an exciting development.?Technically, it is grounded in Large Language Models that enable the development of increasingly knowledgeable replies based on mathematical manipulation of tokenized word sets.?But pragmatically, what generative chat programs are actually delivering are self-organizing knowledge bases that never get tired, bored, or distracted from their self-curation efforts.?Instead, they continue to get better and better over time, asymptotically approaching world-class expertise.?
The key to this success, it should be noted, is the quality and validity of the body of texts they are consuming in the process.?The more expert your system purports to be, the more any “garbage in” can pose a real liability.?That makes data management a top priority for this quadrant.?As we learned from the quality movement, you cannot “inspect in” quality after the fact.?In the case of AI GPT, it is too late to edit the output.?Instead, one must control the corpora of texts which are input to their training and development.?Much of that input, it should be said, comes from an ongoing dialog between the prompter and the application, so keeping a log of that dialog must also be part of the overall data management strategy.?But this data really is worth curating—it is your enterprise knowledge base—so investing here to secure AI GPT success is a high ROI bet.
Key Takeaways
?
That’s what I think.?What do you think?
Kognic | Accelerating Embodied AI
10 个月I might expand the ownership of data as introduced here well beyond IT. That is too narrow. This is particularly true for your 2nd quadrant, where we might also expand beyond "log files" as the best representation of what sits in this bucket. A specific example from a large and exploding segment, is data used to train the "machine" from complex sensor-fusion (LiDAR, Radar, Camera, et al) for use in automated and growing autonomous mobility. These datasets are product management artifacts used to power the products used in applied AI and are rarely (currently) managed by IT. Same for Marketing data, more often than not mirroring that of the most dynamic of IT-managed data pools, but generated, captured and parsed in separate, shadow infrastructures. It's a brave new world. Better data = Better AI.
Partner Alliance Marketing Operations at Data Dynamics
1 年Great insights, Geoffrey Moore! Managing log data for real-time performance improvement, leveraging enterprise communications for knowledge bases, and the transformative potential of ChatGPT are key takeaways. Exciting times ahead! #DataManagement #AI #ChatGPT
mTuitive, Inc. Founder, CFO & Director
1 年Once again, your ability to simplify an extremely complex challenge to explain is amazing. Thank you for at least the 100th time.
Senior Technical Business Strategy Analyst | Strategic Insights, Emerging Technologies, Market Research
1 年There is still the potential bias issue...willful (delberate distortion of input to the model), and "accidental" (misplaced decimal place, wrong units of measure)...is there such a thing as "machine inlearning" in the event of discovery of bad input?
The Product Guy ? Championing "Purpose Driven Innovation" ? 3X Top LinkedIn Voice ? Founding Partner @ Venturis Inc with the stated mission of "Bridging The Valleys" ? Global Citizen
1 年Great insights, as always!! I could not agree more with regulating the output is too late so focus has to be on the input. As you note - for enterprises, controlling the input is a less onerous problem than controlling the input in public domain. I do think that the problem of scale will present itself here as well and once an enterprise crosses a threshold, controlling input will become more challenging. Perhaps a playbook as per the maturity cycle/scale of an enterprise may be needed, however fully agree that there really is no excuse to lean in here