Reducing Friction with DataOps
Paul Lewis
Chief Technology Officer @ Pythian | CIO & CTO | Technology Leadership; Data, Cloud & AI Strategic Impact
In most cases, businesses are intolerant of inefficiency of any kind, especially if it takes the form of people, process, and/or technological bottlenecks. Businesses expect their day-to-day operations, their hierarchies and organizational structures, their supply chains, their IT operations, etc. to run as smoothly and efficiently – to be as frictionless – as possible. These are reasonable and laudable goals.
On the other hand, we can easily imagine scenarios in which friction functions as a kind of necessary, essential, and – in the proper context – advantageous constraint for certain purposes.
At a basic level, friction is what makes Usain Bolt go. It’s what enables Simone Biles to stick her spectacular landings. Most of us can easily imagine how friction operates as a positive factor in the design of mechanical parts, too: tires, brakes, transmissions, flywheels, belt-driven pulleys, etc., all depend on friction to work. In the same way, friction is a key force in fluid mechanics: it’s friction, for example, that helps govern the flow of liquid – water, oil, etc. – in a closed conduit, such as a pipe.
Like it or not, friction fulfills an important, positive function in any system.
*
This is as good a segue as any to the crux of this post, which has to do with data management and data governance. The point I want to make is two-fold. First, friction is actually a useful concept in both disciplines. Governance, for example, is the inevitable product of friction between two or more opposing forces or entities – e.g., divergent values, priorities, purposes, etc. – that come into conflict with one another. In a sense, governance is friction. Albeit friction of a necessary and essential type.
Second, friction is a positive force – a feature, not a bug – because it makes certain things hard to change. If this point seems counter-intuitive, think, for a moment, about your high-value investments. Consider your management information systems (MIS) infrastructure, for example. Some companies pour millions (or tens of millions) of dollars into designing and maintaining their MIS infrastructures. As much as companies like to complain about the state of their data mart, data cube and ETL assets, the fact remains that managers and directors, along with senior and C-level executives, depend on operational reports, dashboards, KPIs, scorecards, etc. to support day-to-day business decision making. As a result, IT doesn’t introduce new data sources, develop new KPIs, devise new business rules, or deploy new dashboards without thoroughly testing and trouble-shooting them. This is one reason it takes so long to provision new data or analytics in conventional data warehouse architecture. Another reason is that – until recently – IT just didn’t have other options. Now it does. I’ll say more about this below.
If it seems as if IT’s priorities are at odds with the self-service ethic, it’s because they are. This is less of a problem than most of us realize, however! Think about it: we’re used to framing the relationship between data management/data governance and self-service as antagonistic: i.e., a zero-sum collision of top-down authority with bottom-up insurgency. Thanks to technological and economic disruption, it’s possible – and more helpful – to see it in terms of a both-and, not a zero-sum, relationship.
**
First, some obligatory background. In the main, self-service tools were a response to users’ frustration with unnecessary and inessential forms of IT friction, such as overly restrictive data management and data governance controls. Self-service forced IT to reassess both disciplines: what was necessary and essential in data management and data governance – e.g., reusable, governed data transformation routines that produce consistent, well-managed data – is less important in other contexts. Consistent, conformed data is less necessary for self-service in general, as well as (to cite a few examples) for analytic discovery, data science, data mining, and machine learning. On the contrary, access to data – preferably, to raw, non-conformed data – is essential to and for these and similar disciplines.
Fast-forward to the present: IT is under pressure to incorporate data from more and varied sources into the operational reports, dashboards, scorecards, etc. that power critical day-to-day business decision making. There’s new pressure to enhance these MIS assets with new KPIs and new (non-SQL-based) analytics. There’s pressure to use these assets to accommodate new practices – such as self-service discovery, data science, and (a combination of both) machine learning research – and as part of new development paradigms for which they’re fundamentally ill-suited.
That’s the crux of the problem. IT is under pressure to accommodate an entirely different data management paradigm: DataOps. Traditional data management presupposed a centralized site for data access and data processing. This was the data warehouse. In the DataOps paradigm, the data warehouse is one among many data sources and data processing engines. Data isn’t just created in a diversity of contexts, it is accessed, moved, and processed in a diversity of contexts, too.
DataOps assumes that data access and data processing are distributed: e.g., that data which may originate at the enterprise edge and must be processed before being vectored to its many and varied destinations. That data engineers and data scientists will build multi-stage pipelines which vector data to and from multiple data sources and/or processing engines – sometimes concurrently. That data will stream continuously from IoT and telemetry signalers, on- and off-premises RESTful services, and the like.
Retrofitting an MIS infrastructure for DataOps involves, at a minimum, introducing radical changes into well-understood and strictly governed data integration processes. And that’s just the beginning. Legacy data management was conceived with strictly structured – i.e., relational – data in mind. Legacy data integration was largely SQL-driven. It operated almost exclusively against other databases or other SQL-compliant data sources – especially OLTP systems. DataOps expects data type diversity, too. OLTP and other systems are still critical sources of data – and intelligence – to be sure. But these systems, even OLTP systems, aren’t the only game in town. And they aren’t necessarily confined to the enterprise core: increasingly, OLTP applications, too, are shifting to the cloud.
Unlike MIS, DataOps presupposes a federated model. You can’t square this circle by retrofitting it.
***
So how do you square this circle? How do you accommodate a set of demands and priorities that seem to be at cross purposes with one another? The solution as I see it is actually pretty simple: you cut the Gordian Knot. You don’t even try to bend your existing MIS infrastructure to support fundamentally new (and, in many cases, radically incompatible) DataOps-like practices and use cases.
The sensible thing to do is to draw a square around your circle – i.e., to create an environment just for DataOps. You can exploit disruptive technologies – e.g., public and private cloud services; converged infrastructure; distributed object storage; etc. – to build a parallel DataOps environment that supports self-service and similar use cases. Even better, your parallel environment would consume exactly the same data that flows into your governed MIS systems – extracted from the same OLTP sources and applying the same transformations. But it won’t impact, or, above all, change your MIS infrastructure.
Best of all, you can slipstream new sources of data and new types of analytics from the parallel DataOps environment into your bread-and-butter MIS infrastructure once they’re proven to be useful and mature. Call it whatever you like: a data lake, a data hub, a data refinery, etc.; the important thing is that your DataOps environment is ideally suited for self-service discovery, data science, ML, artificial intelligence (AI), and other similar practices. Like it or not, DataOps just requires a fundamentally different kind of friction – i.e. different constraints and controls with respect to how data is sourced, how rapidly it is provisioned, how it is engineered and/or changed, how it is used, etc. – than core MIS.
To sum up, there’s essential friction in both environments. It’s just different kinds of friction. Some controls (e.g., controls that govern the use of personally identifiable information, or PII) will be consistent across both environments. Even so, the parallel DataOps environment might make use of automated mechanisms (such as data masking/hashing or differential privacy algorithms) to enable data scientists and ML engineers to use data that contains PII – without exposing the PII in a way that is inconsistent with governance goals or controls. Some contractual and regulatory requirements will also be consistent across both environments. The friction doesn’t go away. But it is appropriate to the environment and to the practices/use cases it is supposed to support. Moreover, the friction dictates the rate or pace at which MIS and DataOps converge: The parallel DataOps environment lets organizations diversify the insights they produce, as well as carefully and deliberately incorporate this diversity into their core MIS processes. In this way, for example, the organization can introduce predictive and prescriptive analytics into its fragile (and tightly controlled) MIS infrastructure.
We’re transitioning to a model in which we manage, secure, and control data in a federated way. At Hitachi Vantara, we like to call this “Edge to Outcomes.” Let friction and the feedback it generates help you navigate this shift. Let friction help you determine the pace at which you shift your MIS systems, processes, workloads, etc. to other contexts – be it the on-premises private cloud, the virtual private cloud, the public cloud, etc. – or, conversely, get enriched with data or analytics that originate in the cloud. Let friction help you determine the pace at which you use IoT data from the edge to enrich operational reports, KPIs, dashboards, and other business-critical MIS assets. Let friction help you determine the pace at which you make data and analytics available to suppliers, partners, or clients – or (conversely) consume data and analytics from these external sources. Let friction help you determine which data you need to keep in place – i.e., in an on-premises database; in inexpensive cloud object storage – where it can be processed in situ. Let friction be your friend, not your foe.
I think you get at what people are most apt to miss about data work and data management. Namely, that data work isn’t in primary (binary) opposition to data management. Neither perspective (self-service/DataOps or data management) is right or wrong, good or bad, valuable or invaluable. In other words, the problem, as I see it – and as you articulate it – isn’t primarily one of balancing the needs and priorities of one constituency (e.g., data governance) against those of another (e.g., data science). It isn’t a contest, let alone a zero-sum one. The purposes of both constituencies, in spite of their divergent purposes and priorities, can be simultaneously valid and vindicated. It’s likewise possible for both constituencies to (in so many words) do their respective things. You capture this nicely. The challenge for people who advocate (and agitate) for change is that of the absolutists in either camp. The pragmatic center should hold. Data management can have its ordered, governed, regulated, etc. data integration (DI) processes, its consistent (audited, transparent, regular) reports, its validated (audited, transparent, regular) analytics, etc. But data scientists and other practitioners who work with data can and should have access to the data they need, more or less when they need it. And these needs are not necessarily opposed to one another. As you note, a data scientist doesn’t necessarily need PII as such – i.e., in its raw, unmasked, or unexpurgated form: in DI, for example, we’ve long used techniques to make sensitive data available to consumers who need it. It’s likewise easier than ever to incorporate these techniques into data integration/engineering processes – increasingly, in fact, they’re built into DI tools; at a basic level, they can easily be incorporated into a data flow pipeline – into databases themselves, into analytical front-end tools, etc. (This, too, is becoming common.) To the degree that it’s advantageous to enhance decision-making with new analytics, this should be done. “Friction,” as you so aptly describe it, is as good a guide as any in this regard. But don’t forget the function of grease, oil, etc.! Sometimes things have to be accelerated, whether we’re quite ready for them or not. We’ve now got pragmatic technologies – and, in the emerging DataOps paradigm, a sort of pragmatic methodology – that make this practicable.
CTO | CIO | Advisor | AI & Innovation Leadership | Cybersecurity & Digital Transformation | Growth & Optimization Strategist
5 年Interesting article indeed. "Retrofitting an MIS infrastructure for DataOps ..." is cool.? Bottom line, any company that can master a smooth data journey knocking down friction points at silos or systems is going to do well. The customer experience from the order-to-cash journey will become much better. Translated - better business and customer satisfaction all around.?
Enterprise Architect at Sirius Computer Solutions - a CDW Company
5 年Well said, Paul. As data flows through an organization, those who are successfully transforming quickly do not have the time to retrofit their existing MIS environment to leverage the external people, process, and technology necessary for success. DataOps is a practice that must be deployed in parallel, integrating with existing assets, not building yet another silo of data or technology. This is in large part because existing environments were not built to handle the new data types, volumes, and speeds at which we now want to consume data.