Including Dark Data as part of your DataOps Strategy
Think about it… If just 5 percent of your employees did not return to work tomorrow, how much information and knowledge would you lose? How much of your critical business information is just stored in your employees’ heads and how much is part of an enterprise wide catalog?
Effective communication is a key success factor in any organization. Ambiguity and misunderstanding can have serious consequences. Let's take an example. The query “Q3 Revenue for NA from gold partners” seems relatively clear, but for the people responsible for generating those numbers and for the stakeholders making decisions based on them, such a query is ambiguous. What date range is Q3 in my company? Which regions make up NA? Who are the gold partners? Which source has the most recent revenue data? In other words, data ambiguity can create bottlenecks for democratizing the data and without democratization, it is hard to achieve agility and speed in business.
Need for a Business Glossary
Based on feedback from numerous IBM engagements, we found that:
- knowledge workers spent more than 25% of their time searching for information to do their jobs
- half of the information returned as search results were not useful
- more than 60% said, internal company information and terminologies could add huge value to their productivity
These 3 findings highlight the value of a company-wide, departmental level business glossary that can provide a consistent business definitions for the knowledge workers. It is estimated that only one fifth of data is publicly searchable as the rest of data resides behind corporate firewalls. Other than the structured and semi-structured data that belongs to "systems of record" and "systems of engagements", most of the content is not being leveraged for gaining insights.
What is Dark Data?
Dark Data is defined as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes such as analytics, business relationships and direct monetizing.
Figure 1: Snippet of a regulatory document showing complex nature of information stored
I would categorize this so-called dark data into:
1. Publicly available content such as government standards and regulations, which are cumbersome to leverage in your business processes
2. Internally produced information assets such as DOUs (Documents of Understanding), MOUs(Memorandums of Understanding), contract, licenses, intellectual property materials
3. Logical data models
The goal is to convert this dark data into a well-formed taxonomy of knowledge and ingest into an enterprise wide catalog, so that it becomes easily searchable, can relate to business processes and help solve complex business problems.
AI powered Glossary generation on Cloud Pak for Data
Cloud Pak for Data is an AI powered DataOps platform. It helps provide knowledge workers with tools and technology to help create an information architecture which can be used to infuse AI capabilities in their business processes. Clients can use Cloud Pak for Data to help create governed data lakes using the IBM catalog and data virtualization technologies. A first step in organizing data from disparate sources, could be for a data steward to initiate a process where machine learning discovers metadata, classifies them and assigns appropriate business terms to make data assets meaningful. For example, AI can help figure out that the column “add1” is a street address and assigns the term "US street address" to it. However, many clients may not have a rich set of business glossaries for their department or the domain they support. Creation of a glossary can be a time consuming task which requires human expertise that may be hard to find. This is a classic problem where AI comes in handy. We introduced two new automated techniques to generate glossaries that can be useful to the business.
Technique1: Generate a glossary from a PDF
A glossary consists of important terms, their descriptions and control statements (policy and rules) categorized by subjects that convey the theme of the documents. Take an example of a regulatory document for CCPA, the California Consumer Privacy Act. They have paragraphs which reference other articles and sections on different pages making it difficult for humans to parse and interpret. Some regulations such as CECL (Current Expected Credit Loss) do not define important concepts - but rather explains them using accounting tables. FDA regulations use tables to store important data. What makes the process more challenging is the variations of definitions from regulation to regulation. For example, definition of a term "third party" can vary as shown below.
Figure 2: Definition of the same term varies from document to document
With natural language processing (NLP) technology, machines have become better in extracting the key concepts while maintaining the context of the document. Optical Character Recognition (OCR) technology helps deal with multi-format documents including one with tables. We use BERT (Bidirectional Encoder Representations from Transformers) NLP that applies an attention model (a mechanism that learns contextual relations between words or parts of words) in a text to the language modeling. A language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models.
Figure 3: Multi-step process of extracting knowledge from a dark data
Other key innovations in this process include rules extraction which is essential for actual outcomes of this process as most of the documents hold a set of policies and rules that drive governance or compliance workflow. Topic modeling is the technique applied to group the terms, descriptions and rules into a logical category of relevant topics. Applying LDA (Latent Dirichlet Allocation) helps create a list of topics with the weight of each word in a topic. Then the top "k" words are used in each topic to represent it. The number of topics generated and the words representing each topic (k) are based on the document vocabulary size. The next step is to extract the context-based word embeddings using BERT and choose the word closest to the centroid of the topic as the topic label. The result is a set of neatly extracted terms, description and rules as shown below.
Figure 4: Cloud Pak for Data workflow to build knowledge from dark data
Technique 2: Generate a glossary from logical data models
Ideally, logical data models, used to document the information architecture of an enterprise, represent business constructs with textual nomenclature and provide a natural relationship to a glossary of terms. In practice however, this is not the case in many enterprises. Most of them have data assets with bunches of tables and columns that represents the technical assets. This technique understand the values hidden in those assets and allow data stewards to discover the metadata as an on-ramp to building a governed data lake. An AI model is used that employs a two-step process to generate a useful business glossary from the data model:
1. Automatically build abbreviation dictionary based on:
a) Column description
Column name: TRANS_RTN_CD
Column description: Provides return codes associated with transactions assigned to each carrier by the prepaid vendor: 001 and 000
b) Previous term assignments to that column
c) A common abbreviation list - the system maintains a dictionary of common abbreviations in a business domain which is used to expand an abbreviated form.
2. Intelligently deciding the correct expanded form and presenting a confidence score
The term Tx might refer to “Tax”, “Transaction” or something else. Machines need to use features such as source (description, previously assigned term, preloaded dictionary), number of assigned terms, edit distance, domain, context, etc., to figure out the right term. It also has to keep humans in the loop so it can learn from the expert acknowledgement.
The outcome and business value can potentially be significant, with some clients having generated 30 thousand terms in less than 5 minutes. Without AI based technologies this might have taken months and involved a team of experts for a large organization. One of the most valuable parts of generating glossaries using the above approaches is that these generated terms become part of a draft workflow which can be reviewed and approved by experts before they become part of the enterprise wide catalog.
Conclusion
Implementing a business glossary can be an important endeavor and often the first step in defining a DataOps strategy. Selecting a high value project can often help deliver a high value business glossary - due to the focus and executive sponsorship associated with it. While doing so, consider not just the obvious sources of knowledge, but hidden knowledge held in cryptic data models and dark data produced every day in your business operations. Cloud Pak for Data can help empower data workers to do that as part of a seamless workflow. Feel free to comment on how you plan to include dark data in your DataOps strategy.
????
5 年Vamshi Krishna Vattem Ramesh Baskaran Anil Bhasker Shashi Ranjan Kiran Challapalli