Available But Unusable Data - Part II - Semantic Gaps

At Scribble Data we are thinking deeply about why decision makers are not able to get to the data when they need even when relevant data is available in their own databases. The reason this question matters is because we find that decision makers routinely make high risk decisions involving products, marketing, and operations with very limited and ambiguous data, and the absolute cost of an incorrect decision, or the opportunity cost of a delayed decision is significant

In my previous article, I elaborated on why available data is unusable. There are systems and organizational issues, and hard technical problems. The focus of this article is the core technical problem of semantic gaps - the gap in the meanings of the question and the data.

The key message is that clean solutions are not on the horizon because semantic gaps are fundamentally unsolvable. Highly valuable incremental solutions are, however, possible through careful process and systems thinking.

Available But Unusable Data

As I discussed in my previous article, the problem is not raw access to the data lying in the database but complexity of the data and the systems in which they live. We have to cope with the fact that high level information is broken up and distributed across various tables and systems, and the understanding of how they are laid out and what each bit means is deep inside application code that takes a specialist to understand and extract.

Syntax and Semantics in Systems

Syntax in systems can be thought of as any information that has concrete representation and unambiguous meaning. For example, the number 2. There is structure such as a 32-bit in-memory representation, and precise tests for the value.

Semantics is everything that is not syntax. For example, you may see the transaction details such as the time and amount, but what is not in the transaction is the conversation that the store manager had with the customer during the execution of the transaction. Perhaps the customer was considering another product until the manager intervened and recommended the product that we saw in the transaction. This information is tacit and not captured anywhere.

Semantics is therefore by definition unbounded. We don’t even “know” that a particular semantic dimension even exists until the right context and question emerges. For example, we don’t know that the conversation the store manager matters until some sales process or legal question is asked e.g., fraud. When such situations arise, we make software modifications to capture this information to whatever extent feasible, i.e., turn semantics into syntax.

There is much that is not in data.

Semantic Gaps Exist Between Questions and Data 

Imagine a simple retail store with a database. The database likely has a table like the one shown below:

Consider a simple question against this database: What is the total revenue last week?

You might have noted that there is no column called Revenue. “Revenue” could refer to pre-tax, post-discount pre-tax, or post-tax amounts. The phrase “last week” has atleast four possible definitions including the week ending today or yesterday, and week ending with previous saturday or sunday.

If we consider variations of the question such as (a) What is the total revenue last week for tall customers?  or (b) What is the marginal revenue last week for tall customers?,  we are looking at more dimensions of ambiguity. The semantic gap grows rapidly with variables.

What Analysts Mainly Do - Resolve Semantic Gaps

Computing systems only understand syntax. Any semantics have to be turned into syntax before interfacing with them. For example, in order to compute on the database, we need the precise specification of the computation in the database specification language such as SQL. English doesnt work.

Business/data analysts undertake four distinct activities to get to data:

  1. Resolve ambiguities - identify variables with multiple possible definitions, and select from them in an organization-meaningful way
  2. Build a specification - identify metrics to compute and mathematical functions to apply, and select data filters to apply for relevance
  3. Generate a program - translate the abstraction specification into appropriate programs such as excel formulae, SQL, and python code.
  4. Execute the program - use database interfaces such as the console and BI scripting framework to execute the program

(1) requires a mix of technical and business understanding. (2), (3) and (4) involve programming and computing system details, and are typically done by technical staff.

Of these, (1) and part of (2) is where analysts contribute most heavily. Once the ambiguities are resolved, the rest is relatively mechanical. In order to execute (1) and (2), analysts do a variety of things such as talking to developers to externalize tacit knowledge about the data layout and identifying the subset of questions that can be answered, talking to business staff to figure out what the user really wants, and making choices about technology paths to execute the specification.

Bridging Semantic Gaps Automatically Is Not Possible

Let us, for a moment, assume that we could bridge all semantic gaps. In that world, our life would be entirely different. We would be able to tell a computer to "plan my vacation" and expect the right thing to happen. The inherent unbounded nature of semantics prevents us from building automatic solutions. It is theoretically not possible.

But all is not lost.

You might have noticed that in the transaction example above, the dimensions of ambiguity are still limited. In such cases, we could potentially use a combination of suitable interfaces, user guidance, intelligent defaults, and mechanisms to recover from mistakes. In effect, the answer is yes, we can provide good enough solutions in many cases.

Semantic Data Access Interfaces are Required

We are human. We don’t think in terms of database schema or SQL. We think in terms of concepts and relationships that are semantics heavy. An interface that allows us to express our thinking as is will improve our ability to relate to data.

Existing business intelligence systems such as Microsoft Power BI and visualization platforms such as Tableau try to give complete control to the end-user through their menus and scripting frameworks, and stay away from unbounded aspects such as semantics. This has worked for a long time, and still continues to work. As the complexity of the databases continuous to grow, and as the number of people doing data analysis rapidly grows the need for fresh approaches is growing.

Newer frameworks such as ThoughtSpot, and our own AskScribble, are making inroads on providing semantic interfaces. They can’t, by definition, be perfect. But they reduce the barrier to accessing data.

In the next article, I will further expand on what systems that provide semantic data interfaces look like and why.

Mahesh Rajagopalan

Global Head - Amazon Flex, Global Head - Last Mile Delivery Tech Org ( Driver and Delivery stations tech, product, and science)

8 年

Great read !

回复
Indrayudh Ghoshal

Scribble Data | Co-Founder & COO | AI for Insurance

8 年

Good stuff Venkata. Touches upon a neat point, which is understanding the context in which data was gathered and stored, to make sense of it later. I like the term "semantic gap" :-)

要查看或添加评论,请登录

Venkata Pingali的更多文章

  • Alignment is Critical: What I’ve Learned About Leading a Cross-Border Startup

    Alignment is Critical: What I’ve Learned About Leading a Cross-Border Startup

    Leading a cross-border organization has taught me that success depends on understanding and adapting to unique…

    6 条评论
  • A Year to Remember

    A Year to Remember

    It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was…

    5 条评论
  • A CEO’s Guide: The Brave New World of Data Privacy and Accountability

    A CEO’s Guide: The Brave New World of Data Privacy and Accountability

    The compliance landscape involving data is significantly changing in 2020, and it is necessary to understand these…

    1 条评论
  • CEO Guide to Production ML

    CEO Guide to Production ML

    Productionization or operationalization of Machine Learning is the process of making machine learning models run every…

    3 条评论
  • Reducing Organizational Data Costs

    Reducing Organizational Data Costs

    We speak to a number of organizations who are in the process of building and deploying data infrastructure and…

    2 条评论
  • How to Architect for Data Consumption

    How to Architect for Data Consumption

    This is my pet peeve - technical architects are building systems and applications that make data analysis complicated…

    13 条评论
  • What Can We Do With Metadata?

    What Can We Do With Metadata?

    As the complexity of data and systems that hold data grows, the cost of analysis increases due to time and effort spent…

  • Implications of WannaCry Ransomware on Data Ecosystem

    Implications of WannaCry Ransomware on Data Ecosystem

    Summary WannaCry is a remarkable attack. It is the first large scale demonstration, large by number of machines, of the…

  • Data Shifts Power Within Organizations

    Data Shifts Power Within Organizations

    A major challenge in going more data-driven in organization has less to do with data itself, and more to do with the…

    5 条评论
  • Available but Unusable Data - Emerging Organizational Challenge

    Available but Unusable Data - Emerging Organizational Challenge

    (This was motivated a recent article on inability of organizations to apply data. This is something I am deeply…

    3 条评论

社区洞察

其他会员也浏览了