Delivering New Business Success Requires a Unified Data Space
Data drives Digital Transformation, M. Stadtmueller

Delivering New Business Success Requires a Unified Data Space

Nothing in business is passive. Every day that a business operates is driven by a strategy. Even having no strategy is a strategy. Whatever that strategy is, it drives growth. No matter if it is a food truck business deciding on food type, location, or truck type or a global energy company, choosing sources, means, and markets; strategy, choices, and decisions take place and their results drive growth (or lack thereof).

With digital transformation, the presence of strategy in business growth does not change. Growth is always about choosing new or better products and services, new or better customer interactions, or new or better ways of doing business. A strategy leveraging Digital Transformation is merely the newest, fastest, least expensive way to deliver growth.  Jeanne Ross at MIT/Sloan gives a strong definition of Digital Strategy as an: “integrated business strategy inspired by the capabilities of powerful readily accessible technologies like social, mobile, analytics, cloud, and internet of things and responsive to constantly changing market conditions.” (link)

Social and mobile are usually focused on new or better customer interactions. Analytics and cloud on new or better ways of doing business. And Internet of Things about innovative anticipatory capabilities based on pervasive connectivity and data gathering to delivers new or better products and services. But, the root of all of this is data. Social and Mobile customer interactions rely on data to and from social and mobile platforms to drive the new and better interactions. Cloud and Analytics drive new or better ways of doing business from data that is provided or enabled. And internet of things drives new products and services from the data that IoT provides and presents. Moreover, and maybe more importantly, social, mobile, analytics, cloud, and IoT together create more expressive data at unbounded volumes. Data is both the root and a result.

Getting a handle on data in business is not new.

However, harnessing data for digital transformation is new. Databases, and Data Warehouses, and more recently Data Lakes have made their mark and become established business capabilities. However, the recent advances in Artificial Intelligence have become a key capability for digitally inspired business initiatives. In AI, models need to be trained from datasets that need to be created, fused, and then run through models to train the models. Databases, Data Warehouses, and Data Lakes are not well suited or specifically designed to train models from created “virtual” datasets.

Indeed, while AI continues to capture the imagination of the public and rightly demands the attention of enterprises and businesses of all types, a recent KDNuggets article sums up the current situation nicely. “Academic papers are almost entirely focused on new and improved models, with datasets usually chosen from a small set of public archives. Everyone I know who uses deep learning as part of an actual application spends most of their time worrying about the training data instead.” https://www.kdnuggets.com/2018/06/improve-training-data-how.html. The article posts a picture that humorously describes the challenge with data in business (in this case Tesla) vs in Academia (side note, I hear from many academics that data is no less a challenge there as well).

Lisha Li from a presentation by Andrej Karpathy

Challenges in business with data

The first challenge with Business data is the oft-mentioned data silos. Data silos are viewed as a negative result of outdated systems and processes or the result of the creation of many different data stores for any individual initiative or merger or acquisition. While this is true, it is only part of the story. Data sometimes has to be put into silos to meet security and compliance requirements. For example, there might be obvious benefit to merging customer purchase history with external data sources. However, the price a customer paid for products and services can be very sensitive and therefore there are good business reasons to keep it in the silo. So, while disparate, costly and outdated silos are an issue, they are not the only issue. Sensitive, secure, and compliant data is required to be siloed because the systems do not have the ability to provide role based access or authorization based access to individual data elements once they are merged.

The second challenge is that missing and messy data is a de-facto part of business data. Getting clean data sets is a pre-requisite for modern AI based models. But, that is just not the case with most all business data. And data science techniques to deal with missing and messy data can often have negative consequences because often those null values (“NAs”) and errors have inherent meaning in themselves.

The third challenge is scaling and cost of data initiatives. The systems used to capture, secure, and harness data have traditionally been very expensive and efforts to merge or migrate data are almost always, very large, long, time consuming and very costly. Very often the effort to capture, secure, and harness data becomes intractable in business.

Why Data Lakes, Databases, Data Warehouses do not address the needs of digital transformation

While it is common in business discussions to discuss the limitations or license costs of databases, the investments required for Data Warehouses, or how Data Lakes often seem to become data swamps, the reality is that all these platforms are very good for the purposes they perform, however none are very well suited for the goals of Digital Transformation and leveraging AI. In order for a business to deliver on a strategy leveraging Digital Transformation, data must be leveraged to detect, classify, segment, predict, or recommend something that can then be applied to existing or new business applications and processes.

In AI, this is called “creating a model”, “training the model on data”, and finally “serving the trained model”. In order to do this, data must be captured, secured, and harnessed into a “dataset” that a model can be trained on. Wikipedia describes attributes of a dataset here: https://en.wikipedia.org/wiki/Data_set. “Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question.” In general, AI requires a dataset of “well organized” data.

The problem with this in business is two-fold:

First, where well formed data already exists, most businesses have already leveraged business intelligence systems. While these systems may be costly and may not deliver the modeling accuracy of an AI platform, they are already in use and serving a purpose.

But, more importantly, secondly, the opportunity for Digital Transformation lies in new relationships that can be gleaned from seemingly disparate data. E.g. relating geographic weather data to retail store location sales data as well as current traffic patterns and conditions to better predict inventory requirements for retail stores or for that matter, location of stores, or what type of stores). This data can and does come from multiple data locations, it comes in different modalities, it is often not well formed and incomplete, and it often has unique security and handling requirements that varies across the collected data.

The challenge for businesses is that it is impractical or too costly to do this data ingestion and transformation in databases or data warehouses and not doable in a data lake.

A Dataspace is required for Digital Transformation empowered with AI

A dataspace is required to overcome these challenges. Dataspaces allow for abstractions in data that can reduce effort and overcome data integration challenges. 

To address the aforementioned challenges, Lucd has developed our Unified Data Space leveraging Accumulo NoSQL. 

The  Lucd Unified Data Space (UDS) is a self describing object store with fine grained security. It balances efficiency, flexibility and readability when storing data objects. Data access is efficient because a single access is all that is required to retrieve any object. Decoding object values and metadata information is simple and straightforward. The data space leverages Accumulo and is flexible since there are no restrictions on what data can be stored. And it is readable because data is stored in text form wherever possible. Only truly binary data is stored in binary form. This allows developers, analysts and administrators to browse and even insert data into the data space using existing command-line tools.

In the Unified Data Space, one row represents one data object. An object may represent a data file, a database row or any other entity composed of attributes. The attribute is the basic unit of storage in UDS. Attributes are grouped into objects based on the row ID. The attribute is the fundamental unit of data in UDS. Attributes are primitive data types which can be grouped into complex structures internally and then aggregated to form a complete object representation.

Names give attribute values meaning. In UDS, attribute names represent the hierarchical structure of the data model using dot (.) notation. Examples in this document will use the following object definition:

Rather than storing an entire row as a single element, individual attributes are stored as cells. Reads from and writes to Accumulo are done one cell at a time.

Cells are stored as key-value pairs, where the key and value are comprised of the following elements:

The Unified Data Space stores objects in a table called dataspace. The dataspace table is the primary object store. Other tables contain analytic output and indexes that all refer to objects in the dataspace table.

The Visibility mark contains the rules governing which users have access to each attribute. Boolean operators are used to specify combinations of access controls. For example, assume that a system has access controls labeled A, B, C and D. The visibility mark “(A&B)|D” says that a user must have A and B access privileges or the D access privilege in order to access this attribute. The list of access control marks, the rules governing access to attributes and the method to map users to authorization lists all depend on the specific requirements of the customer.

The Unified Data Space empowers a Virtual Dataset

When delivering business value from data, available data is searched and that search yields a “result set”. But, that result set may span many databases, data stores, data types, data modalities, and formats that require data transformation in order to leverage for models. How to perform these transformations, where and how to store results can be challenging from a compute, storage and security point of view. With a Unified Data Space like deployed in Lucd, and leveraging the Lucd EDA (exploratory data analysis) capability, the challenges of “the messy data required for valuable digital transformation” and “the well formed data required for AI” is overcome through the creation of a “Virtual Dataset”.

The Virtual Dataset empowered by the Lucd Unified Data Space becomes the dynamic glue to rapidly bring data to models. An existing or new model may require data from many different datasets. When envisioning that data needed to feed the model, it is easy in the Unified Data Space to search, identify, and tag that data. And then join that data into the Virtual Data Set that is needed for that unique model.

The Virtual Data Set can then either be saved virtually (reference to locations in the UDS) or a separate data set can also be created. This Virtual Data Set can then be readily used to train models. But, more importantly as models change or new models are incorporated, that VDS can be recalled and the corresponding transformations can also be recalled and edited.

As the business requirements and/or model requirements change, new or different data would be required and new Virtual Data Sets need to be rapidly created. The Lucd Unified Data Space empowers this capability and therefore empowers the dynamics needed for timely leveraging of data and leveraging of models to meet business opportunities.

Conclusion

Jeff Bezos has stated “The only sustainable advantage you can have over others is agility, that’s it. Because nothing else is sustainable, everything else you create, somebody else will replicate.”

The fastest, most efficient, and cost effective way to create is through digital transformation that is driven by data. But, the valuable data that drives differentiation is messy while the models needed to transform require organization. The Lucd Unified Data Space crosses this chasm and allows businesses to implement Enterprise AI.

Randy Schrock

Strategic Opportunities & Programs Director, Zoom

6 年

Good stuff Mark. As always, intelligent, timely and relevant.

Excellent article - great job in capturing today's challenges around data analysis and realizing true digital transformation

Building your Competitive Digital Advantage Leverage your Data Assets – Capture and Securing Your Data Supply Chain – Lucd's Unified Data Space & Data Compliance is a must for Enterprise AI

要查看或添加评论,请登录

Mark Stadtmueller的更多文章

社区洞察

其他会员也浏览了