登录查看更多内容

Including Dark Data as part of your DataOps Strategy

Rakesh Ranjan

Director at IBM leading AI vision for strategic growth

发布日期: 2019年11月7日

Think about it… If just 5 percent of your employees did not return to work tomorrow, how much information and knowledge would you lose? How much of your critical business information is just stored in your employees’ heads and how much is part of an enterprise wide catalog?

Effective communication is a key success factor in any organization. Ambiguity and misunderstanding can have serious consequences. Let's take an example. The query “Q3 Revenue for NA from gold partners” seems relatively clear, but for the people responsible for generating those numbers and for the stakeholders making decisions based on them, such a query is ambiguous. What date range is Q3 in my company? Which regions make up NA? Who are the gold partners? Which source has the most recent revenue data? In other words, data ambiguity can create bottlenecks for democratizing the data and without democratization, it is hard to achieve agility and speed in business.

Need for a Business Glossary

Based on feedback from numerous IBM engagements, we found that:

knowledge workers spent more than 25% of their time searching for information to do their jobs
half of the information returned as search results were not useful
more than 60% said, internal company information and terminologies could add huge value to their productivity

These 3 findings highlight the value of a company-wide, departmental level business glossary that can provide a consistent business definitions for the knowledge workers. It is estimated that only one fifth of data is publicly searchable as the rest of data resides behind corporate firewalls. Other than the structured and semi-structured data that belongs to "systems of record" and "systems of engagements", most of the content is not being leveraged for gaining insights.

What is Dark Data?

Dark Data is defined as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes such as analytics, business relationships and direct monetizing.

Figure 1: Snippet of a regulatory document showing complex nature of information stored

I would categorize this so-called dark data into:

1. Publicly available content such as government standards and regulations, which are cumbersome to leverage in your business processes

2. Internally produced information assets such as DOUs (Documents of Understanding), MOUs(Memorandums of Understanding), contract, licenses, intellectual property materials

3. Logical data models

The goal is to convert this dark data into a well-formed taxonomy of knowledge and ingest into an enterprise wide catalog, so that it becomes easily searchable, can relate to business processes and help solve complex business problems.

AI powered Glossary generation on Cloud Pak for Data

Cloud Pak for Data is an AI powered DataOps platform. It helps provide knowledge workers with tools and technology to help create an information architecture which can be used to infuse AI capabilities in their business processes. Clients can use Cloud Pak for Data to help create governed data lakes using the IBM catalog and data virtualization technologies. A first step in organizing data from disparate sources, could be for a data steward to initiate a process where machine learning discovers metadata, classifies them and assigns appropriate business terms to make data assets meaningful. For example, AI can help figure out that the column “add1” is a street address and assigns the term "US street address" to it. However, many clients may not have a rich set of business glossaries for their department or the domain they support. Creation of a glossary can be a time consuming task which requires human expertise that may be hard to find. This is a classic problem where AI comes in handy. We introduced two new automated techniques to generate glossaries that can be useful to the business.

Technique1: Generate a glossary from a PDF

A glossary consists of important terms, their descriptions and control statements (policy and rules) categorized by subjects that convey the theme of the documents. Take an example of a regulatory document for CCPA, the California Consumer Privacy Act. They have paragraphs which reference other articles and sections on different pages making it difficult for humans to parse and interpret. Some regulations such as CECL (Current Expected Credit Loss) do not define important concepts - but rather explains them using accounting tables. FDA regulations use tables to store important data. What makes the process more challenging is the variations of definitions from regulation to regulation. For example, definition of a term "third party" can vary as shown below.

Figure 2: Definition of the same term varies from document to document

With natural language processing (NLP) technology, machines have become better in extracting the key concepts while maintaining the context of the document. Optical Character Recognition (OCR) technology helps deal with multi-format documents including one with tables. We use BERT (Bidirectional Encoder Representations from Transformers) NLP that applies an attention model (a mechanism that learns contextual relations between words or parts of words) in a text to the language modeling. A language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models.

A multi-step process of generating a well formed taxonomy from a PDF document

Figure 3: Multi-step process of extracting knowledge from a dark data

Other key innovations in this process include rules extraction which is essential for actual outcomes of this process as most of the documents hold a set of policies and rules that drive governance or compliance workflow. Topic modeling is the technique applied to group the terms, descriptions and rules into a logical category of relevant topics. Applying LDA (Latent Dirichlet Allocation) helps create a list of topics with the weight of each word in a topic. Then the top "k" words are used in each topic to represent it. The number of topics generated and the words representing each topic (k) are based on the document vocabulary size. The next step is to extract the context-based word embeddings using BERT and choose the word closest to the centroid of the topic as the topic label. The result is a set of neatly extracted terms, description and rules as shown below.

Figure 4: Cloud Pak for Data workflow to build knowledge from dark data

Technique 2: Generate a glossary from logical data models

Ideally, logical data models, used to document the information architecture of an enterprise, represent business constructs with textual nomenclature and provide a natural relationship to a glossary of terms. In practice however, this is not the case in many enterprises. Most of them have data assets with bunches of tables and columns that represents the technical assets. This technique understand the values hidden in those assets and allow data stewards to discover the metadata as an on-ramp to building a governed data lake. An AI model is used that employs a two-step process to generate a useful business glossary from the data model:

1. Automatically build abbreviation dictionary based on:

a) Column description

Column name: TRANS_RTN_CD

Column description: Provides return codes associated with transactions assigned to each carrier by the prepaid vendor: 001 and 000

b) Previous term assignments to that column

c) A common abbreviation list - the system maintains a dictionary of common abbreviations in a business domain which is used to expand an abbreviated form.

2. Intelligently deciding the correct expanded form and presenting a confidence score

The term Tx might refer to “Tax”, “Transaction” or something else. Machines need to use features such as source (description, previously assigned term, preloaded dictionary), number of assigned terms, edit distance, domain, context, etc., to figure out the right term. It also has to keep humans in the loop so it can learn from the expert acknowledgement.

The outcome and business value can potentially be significant, with some clients having generated 30 thousand terms in less than 5 minutes. Without AI based technologies this might have taken months and involved a team of experts for a large organization. One of the most valuable parts of generating glossaries using the above approaches is that these generated terms become part of a draft workflow which can be reviewed and approved by experts before they become part of the enterprise wide catalog.

Conclusion

Implementing a business glossary can be an important endeavor and often the first step in defining a DataOps strategy. Selecting a high value project can often help deliver a high value business glossary - due to the focus and executive sponsorship associated with it. While doing so, consider not just the obvious sources of knowledge, but hidden knowledge held in cryptic data models and dark data produced every day in your business operations. Cloud Pak for Data can help empower data workers to do that as part of a seamless workflow. Feel free to comment on how you plan to include dark data in your DataOps strategy.

Alok Kumar

????

5 年

Vamshi Krishna Vattem Ramesh Baskaran Anil Bhasker Shashi Ranjan Kiran Challapalli

要查看或添加评论，请登录

Rakesh Ranjan的更多文章

Transforming IBM Support with IBM watsonx Orchestrate and IBM watsonx Code Assistant: A Chat with Enrico Monteleone

2025年3月19日

Transforming IBM Support with IBM watsonx Orchestrate and IBM watsonx Code Assistant: A Chat with Enrico Monteleone

In the world of enterprise software support, speed and efficiency are critical. As LLM-powered automation takes center…
Pi Day and the Power of Endless Numbers: How Irrational Constants Shape Our World!

2025年3月14日

Pi Day and the Power of Endless Numbers: How Irrational Constants Shape Our World!

Happy Pi Day! Today, March 14th (3/14) is celebrated worldwide as π Day, honoring the most famous endless number, π (pi…
Wind, Calm, and Inner Drive!

2025年3月4日

Wind, Calm, and Inner Drive!

A sailboat can glide quickly and gracefully across the water when the wind is in its sails. On a calm and breezeless…

2 条评论
Unleashing the AI Gold Rush: Startups, Tools, and the Rise of LLM Ecosystems

2024年12月29日

Unleashing the AI Gold Rush: Startups, Tools, and the Rise of LLM Ecosystems

Reflecting on 2024, I’m struck by the dynamic wave of opportunity that defined the year. From AI breakthroughs to new…

10 条评论
Empowering Software Support: The Transformative Role of Generative AI

2024年10月12日

Empowering Software Support: The Transformative Role of Generative AI

I've had the privilege of leading the charge in developing GenAI-powered tools for IBM software support, unlocking a…

5 条评论
Search and AI-driven insights on your enterprise data

2019年2月11日

Search and AI-driven insights on your enterprise data

You can search for flights and movies and you get exactly what you are looking for. Why is it so hard to search…
Privacy and Prosperity - Is your business ready for CCPA?

2019年2月10日

Privacy and Prosperity - Is your business ready for CCPA?

If you have not heard about Alastair Mactaggart's story yet it's more than inspiring. His solo initiative to protect…
18 Years and counting....

2018年11月18日

18 Years and counting....

November 13 made my 18th work anniversary at IBM and I couldn’t help but look back and wonder what it was all about… I…

3 条评论
Technology showcase day

2018年5月19日

Technology showcase day

What a day! Payoff time for 2 semesters of hard work for graduating students of Electrical engineering and Computer…

9 条评论
Can a president jump the line for an organ transplant?

2017年12月18日

Can a president jump the line for an organ transplant?

Not with Blockchain ! If you watched the season 4 of popular Netflix series House of the Cards, President Frank…

See all articles

Including Dark Data as part of your DataOps Strategy

Rakesh Ranjan

Director at IBM leading AI vision for strategic growth

Need for a Business Glossary

What is Dark Data?

AI powered Glossary generation on Cloud Pak for Data

Technique1: Generate a glossary from a PDF

Technique 2: Generate a glossary from logical data models

Rakesh Ranjan的更多文章

社区洞察

其他会员也浏览了

Embracing the 'Always On' Data Flow: Benefits and Challenges

Microsoft Purview

September 27, 2021

The Power of Data: Transform Your Business with a Modernized Data Platform

Informative: What is Data democratization ?

How to Use Opsware’s Data Map Calculator to Determine the Cost of Data Mapping and Discovery

Guide to Dataverse Naming Conventions and Best Practices

Can you find the data you need?

How to Improve Data Accessibility and Quality for More Confident Decision-Making

New RESO Workgroup Tackles Data Payload Standardization

Need for a Business Glossary

What is Dark Data?

AI powered Glossary generation on Cloud Pak for Data

Technique1: Generate a glossary from a PDF

Technique 2: Generate a glossary from logical data models

Rakesh Ranjan的更多文章

Transforming IBM Support with IBM watsonx Orchestrate and IBM watsonx Code Assistant: A Chat with Enrico Monteleone

Pi Day and the Power of Endless Numbers: How Irrational Constants Shape Our World!

Wind, Calm, and Inner Drive!

Unleashing the AI Gold Rush: Startups, Tools, and the Rise of LLM Ecosystems

Empowering Software Support: The Transformative Role of Generative AI

Search and AI-driven insights on your enterprise data

Privacy and Prosperity - Is your business ready for CCPA?

18 Years and counting....

Technology showcase day

Can a president jump the line for an organ transplant?

社区洞察

其他会员也浏览了

Embracing the 'Always On' Data Flow: Benefits and Challenges

Microsoft Purview

September 27, 2021

The Power of Data: Transform Your Business with a Modernized Data Platform

Informative: What is Data democratization ?

How to Use Opsware’s Data Map Calculator to Determine the Cost of Data Mapping and Discovery

Guide to Dataverse Naming Conventions and Best Practices

Can you find the data you need?

How to Improve Data Accessibility and Quality for More Confident Decision-Making

New RESO Workgroup Tackles Data Payload Standardization