Available But Unusable Data - Part II - Semantic Gaps
Venkata Pingali
Scribble Data | AI for Financial Services | Co-Founder & CEO | Hiring!
At Scribble Data we are thinking deeply about why decision makers are not able to get to the data when they need even when relevant data is available in their own databases. The reason this question matters is because we find that decision makers routinely make high risk decisions involving products, marketing, and operations with very limited and ambiguous data, and the absolute cost of an incorrect decision, or the opportunity cost of a delayed decision is significant
In my previous article, I elaborated on why available data is unusable. There are systems and organizational issues, and hard technical problems. The focus of this article is the core technical problem of semantic gaps - the gap in the meanings of the question and the data.
The key message is that clean solutions are not on the horizon because semantic gaps are fundamentally unsolvable. Highly valuable incremental solutions are, however, possible through careful process and systems thinking.
Available But Unusable Data
As I discussed in my previous article, the problem is not raw access to the data lying in the database but complexity of the data and the systems in which they live. We have to cope with the fact that high level information is broken up and distributed across various tables and systems, and the understanding of how they are laid out and what each bit means is deep inside application code that takes a specialist to understand and extract.
Syntax and Semantics in Systems
Syntax in systems can be thought of as any information that has concrete representation and unambiguous meaning. For example, the number 2. There is structure such as a 32-bit in-memory representation, and precise tests for the value.
Semantics is everything that is not syntax. For example, you may see the transaction details such as the time and amount, but what is not in the transaction is the conversation that the store manager had with the customer during the execution of the transaction. Perhaps the customer was considering another product until the manager intervened and recommended the product that we saw in the transaction. This information is tacit and not captured anywhere.
Semantics is therefore by definition unbounded. We don’t even “know” that a particular semantic dimension even exists until the right context and question emerges. For example, we don’t know that the conversation the store manager matters until some sales process or legal question is asked e.g., fraud. When such situations arise, we make software modifications to capture this information to whatever extent feasible, i.e., turn semantics into syntax.
There is much that is not in data.
Semantic Gaps Exist Between Questions and Data
Imagine a simple retail store with a database. The database likely has a table like the one shown below:
Consider a simple question against this database: What is the total revenue last week?
You might have noted that there is no column called Revenue. “Revenue” could refer to pre-tax, post-discount pre-tax, or post-tax amounts. The phrase “last week” has atleast four possible definitions including the week ending today or yesterday, and week ending with previous saturday or sunday.
If we consider variations of the question such as (a) What is the total revenue last week for tall customers? or (b) What is the marginal revenue last week for tall customers?, we are looking at more dimensions of ambiguity. The semantic gap grows rapidly with variables.
What Analysts Mainly Do - Resolve Semantic Gaps
Computing systems only understand syntax. Any semantics have to be turned into syntax before interfacing with them. For example, in order to compute on the database, we need the precise specification of the computation in the database specification language such as SQL. English doesnt work.
Business/data analysts undertake four distinct activities to get to data:
- Resolve ambiguities - identify variables with multiple possible definitions, and select from them in an organization-meaningful way
- Build a specification - identify metrics to compute and mathematical functions to apply, and select data filters to apply for relevance
- Generate a program - translate the abstraction specification into appropriate programs such as excel formulae, SQL, and python code.
- Execute the program - use database interfaces such as the console and BI scripting framework to execute the program
(1) requires a mix of technical and business understanding. (2), (3) and (4) involve programming and computing system details, and are typically done by technical staff.
Of these, (1) and part of (2) is where analysts contribute most heavily. Once the ambiguities are resolved, the rest is relatively mechanical. In order to execute (1) and (2), analysts do a variety of things such as talking to developers to externalize tacit knowledge about the data layout and identifying the subset of questions that can be answered, talking to business staff to figure out what the user really wants, and making choices about technology paths to execute the specification.
Bridging Semantic Gaps Automatically Is Not Possible
Let us, for a moment, assume that we could bridge all semantic gaps. In that world, our life would be entirely different. We would be able to tell a computer to "plan my vacation" and expect the right thing to happen. The inherent unbounded nature of semantics prevents us from building automatic solutions. It is theoretically not possible.
But all is not lost.
You might have noticed that in the transaction example above, the dimensions of ambiguity are still limited. In such cases, we could potentially use a combination of suitable interfaces, user guidance, intelligent defaults, and mechanisms to recover from mistakes. In effect, the answer is yes, we can provide good enough solutions in many cases.
Semantic Data Access Interfaces are Required
We are human. We don’t think in terms of database schema or SQL. We think in terms of concepts and relationships that are semantics heavy. An interface that allows us to express our thinking as is will improve our ability to relate to data.
Existing business intelligence systems such as Microsoft Power BI and visualization platforms such as Tableau try to give complete control to the end-user through their menus and scripting frameworks, and stay away from unbounded aspects such as semantics. This has worked for a long time, and still continues to work. As the complexity of the databases continuous to grow, and as the number of people doing data analysis rapidly grows the need for fresh approaches is growing.
Newer frameworks such as ThoughtSpot, and our own AskScribble, are making inroads on providing semantic interfaces. They can’t, by definition, be perfect. But they reduce the barrier to accessing data.
In the next article, I will further expand on what systems that provide semantic data interfaces look like and why.
Global Head - Amazon Flex, Global Head - Last Mile Delivery Tech Org ( Driver and Delivery stations tech, product, and science)
8 年Great read !
Scribble Data | Co-Founder & COO | AI for Insurance
8 年Good stuff Venkata. Touches upon a neat point, which is understanding the context in which data was gathered and stored, to make sense of it later. I like the term "semantic gap" :-)