The Challenges of Virtual Knowledge Graphs
Virtual (federated) Knowledge Graph

The Challenges of Virtual Knowledge Graphs

Nothing is new under the sun.

The idea of virtual access to data in remote (and often heterogeneous) data stores is nothing new.?? I remember back in the mid-1990’s when RDBMSs all introduced the concept of “remote data access” – aka “virtual tables”.?? This was followed by the notion of “Federated Databases” (late 90’s) with specialty data store “plugin” architectures to RDBM engines...ultimately culminating in a discipline called “Enterprise Information Integration (EII)” by the mid-2000’s.?? Many RDBMS vendors marketed and sold a variety of solutions that were supposed to help achieve EII along with Enterprise Information Management (EIM) and Enterprise Application Integration (EAI) as the trifecta for “Enterprise Information”.?? The goals then (and now) for EII were:

  1. Reduce the deployment costs (storage and licensing) by eliminating duplicate storage of data.
  2. Provide real-time access to data vs. latency introduced by ETL and data movement/replication.
  3. Provide an enterprise-wide, unified view of business data allowing cross-system queries to be executed without weeks/months necessary to first integrate the systems.

These implementations struggled – and today, it would hard to find any real-world successful implementations still active.??

There is another common phrase I almost used as a heading for this – “Those that don’t learn from history are doomed to repeat it”.?? There is a bit of hype in some areas about “virtual knowledge graphs” or virtual graph databases in general that I suspect some folks will be repeating history.? While it is true that technology has advanced a lot in the last 20 years – and we have a new generation of technologists eager to try it out – my personal observation is that a lot of the challenges remain the same.? They remain because they are tough challenges.?

The challenges

The key reason for the lack of successful implementations is that there are some very real challenges:

  • In order to perform well, there needs to be consistent naming standards and domains across source systems or a metadata translation dictionary.
  • Data cleansing and consistency in data entry between disparate systems is often lacking. Data cleansing is often a key aspect to ETL - but if you remove ETL, you have issues with data quality.
  • Federated query optimization depends on data locality and data cardinality knowledge.?? Maintaining this metadata is a constant, ongoing process that needs to be automated.
  • Distributed/Federated Query Processing often requires query fragmentation/division rewrites of queries for execution – which is a difficult task beyond simplistic point queries and simple aggregations.
  • The federated database needs to be able to merge very large results from separate sources into a single result set.? This consumes a lot of hardware resources.
  • Implementations are complex, fragile and consume a lot of time – which sort of negates the entire promise of avoiding months of integration as referenced in point 3 earlier.

As a result, by the late-2000’s/early-2010’s most EII projects had been abandoned.? The general consensus was that data virtualization can work well as a tactical solution within some narrowly scoped use cases, but it suffers from severe performance and complexity at any real volume or scale.? Some of this will become more evident as we discuss these challenges in the context of virtual graph implementations.

The Virtual Knowledge Graph

Let's consider the three main approaches to any virtual database.

Metadata-only Federated Database

This approach has a lot of attractiveness to customers that have spent the last 5 years implementing a data lake - and don't wish to make yet another copy of the data. In the metadata database, a virtual graph would implement a metadata catalog of the business data concepts and storage in the disparate systems.?? The first attempt at this might be to try to enforce data naming and domain standardization rules – such as “customer_id” is used universally and is a string field of 20 characters (or something like that).?? The first problem with this approach is that many software packages are developed by external entities, and they’ve already defined their schema.?? Adding to the fun, they may have been developed in a country with a different language.?? A classic example: SAP Business Suite – a very common ERP implementation.?? Most of the column names in SAP tables stem from the German language such as:

  • MANDT (“mandant” – tenant/client id, or more accurately “business organization id”)
  • KUNNR (probably “kunden” – customer or customer number)
  • VBELN (“vertriebsbelegnummer” – sales document number or (purchase) order number)

If attempting to integrate with another packaged application that uses business_unit, customer_id, purchase_order, you need to define a common schema and then map the common schema to the different source system schemas.? This is acerbated by the plethora of legacy business systems still in use that would be cost prohibitive to spend time re-developing to a new naming schema.?? There are other issues such as lack of consistent domains – or even when the domains are consistent (e.g. customer id), they are assigned independently in different systems. Then there is a challenge of keeping the metadata in sync - there are any number of tools and methods for replicating data values - but few that can detect and propagate schema changes.

This metadata focused solution was the approach that many of the mid-2000’s EII solutions took as the first real attempt to provide an enterprise solution beyond the one-off “remote data access” tables embedded in application databases.? Often this metadata approach was supplemented by caching techniques to improve query response times.?? However, strict metadata only solutions did not achieve any real success, and as mentioned, by 2010’s – no longer major players.

Metadata + Key Attributes/Instance Pointers

Early on in database design, we come across data that is just too voluminous to want to store within the database itself.? For example, the contents of webpages or books.?? As a result, we resort to storing a “pointer” or other information that the application can then use to retrieve the information from the source itself – e.g. the URL.

This applies to virtual knowledge graphs as well.? The problem with data integration is that it isn’t just the schemas – it also is the values that often need mapped to corresponding values in other systems.?? For example, in one system, the customer_id might be a GUID while in another it might be a sequential number.?? As a result, the virtual graph actually needs to implement a physical node to represent the actual customer entity along with the identifying properties in each source system.?? Since most Graph DBMS’s that are advertising virtual graph capabilities are RDF-based triple stores, this means that the graph database has to store:

  • A node with a URI that represents the customer entity.
  • One or more relationships to nodes representing the primary key values in the source systems.
  • Multiple relationships or schema metadata that maps each of the graph schema properties to the corresponding source system property.

More importantly, from a pure knowledge graph perspective, this allows us to create relationships between entities that aren’t implemented in the current data model – or between different systems.? This capability is often critical to knowledge graph queries.

Now we’ve introduced a data consistency issue.? While it is idealistic that primary key values do not change, we all know that in real life, they can and do change.?? As a result, we now need a way to keep the primary key values stored in the graph in-sync – with whatever latency is tolerable by our applications using the virtual knowledge graph.

One of the realizations from this should be that in a virtual graph, the reality is that much of the complexity of infrastructure necessary to synchronize the key attributes resolves much of the complexity of synchronizing other data as well – more on why this is important later.

Metadata + Key Attributes + Common Search Fields

Quite often this is the most common virtual knowledge graphs implementation.?? The bulk of the knowledge graph is actually stored in the graph DBMS in that not only are the entities and their identifying attributes stored, but also key common search fields and metrics.?? For example, customer entities may have the customer’s name, the customer type, the customer city, state, postal code all stored in the graph-based knowledge graph.? Similarly, the product SKU, product name, price and other key fields related to the products.? However, the customer’s address field (which can often be quite long and rarely used for analytics) or the product detailed description are not stored in the knowledge graph.?? In a sense the data in the graph-based knowledge graph has been minimized to just what is necessary to perform the graph query patterns and analysis – without all the extra properties that a full “digital twin” might include.

Beyond extra properties, the detail data stored in other enterprise data warehouses/data marts/data lake may also not be stored in the knowledge graph.?? For example, the data lake may contain every purchase a customer made for a specific product going back several decades.?? The knowledge graph may have a single relationship between the customer node and the product node – which in many cases may simply include an aggregate of the number of purchases if the planned analysis would use it as a weighting.

This last implementation is actually one that we recommend at Neo4j in general when people are implementing a knowledge graph.? Early in the modeling class, we stress removing unnecessary properties from the source data – using the phrase “eliminating decoration”.? The same is true of unnecessary history within the graph database when the importance from an analysis perspective is the weighting achievable via an aggregate value on the relationship between the two entities.?? As a result, the “virtual” knowledge graph implementations in competitor products often resemble the identical implementation in Neo4j with respect to the data content.

Federated Query Processing

The graph store is a very minor part of the success of a virtual knowledge graph.? The key to its success is in how well it handles queries and whether the performance is within reasonable expectations as compared to a local knowledge graph within the graph DBMS instance.? As a general topic, distributed query processing (which includes federated queries) has been a topic for many PhD theses.? As it is an extremely complex topic, we will only consider some basic aspects in the consideration of a virtual knowledge graph.

Point Queries

The most performant federated query from a virtual graph is a “point query”.?? In this case, we define a “point query” as a query in which the graph database performs the bulk of the query processing but only retrieves a small set of records from the source systems in order to provide return values or final filtering of results.?? A classic example of this is in Stardog’s touted “Trillion Edge Knowledge Graph” demonstration report by McKnight Consulting Group dated June 2021.? Consider the following depiction in the report of the data distribution:

Data stores/model for Stardog's Trillion Edge Knowledge Graph
Data stores & models for Stardog's Trillion Edge Knowledge Graph

Some assumptions we could make about the above implementation is that the "Product" entity is common to all 3 data sources and is using a common key attribute.

The 12 queries used for the demonstration were summarized in a table in the report as:

Summary of the "12" queries in Stardog's demonstration of Trillion Edge Knowledge Graph

For the purpose of discussing virtual graphs and data federation, we will ignore the first 4 queries as they do not involve a distributed query – and queries 5 & 6 were not identified.?? So, this leaves only the last 5 queries.?? Even then, queries 8-11 could be pushed entirely down to the remote DBMS.? Ignoring that, in each case, a single Product, Review or Offer was the focus of the query.? As a result, the Stardog engine would merely need to perform a “cursor” type query in which the initial query processing (if any) would all be done in Stardog and then each result row, perform a query against the federated source and do an implicit join.?? This type of query works extremely well when:

  • The results from the local store are small.
  • For each of the results from the local store, the federated database does not return very many rows to be joined.

A classic example is the case of query 10 – The initial product would be fetched in Stardog and then the product identifier and SQL join would be submitted to Redshift allow with the portions of the where clause to filter intermediate results.? Overall, one would not be expecting more than a few rows to be returned.

Graph Analytics in a Relational Store

In the above examples, the queries would be extremely simplistic.? At most, the remote store would only be performing 1 or 2 join operations.? One of the reasons why projects consider graph queries is when the full query spans a deep traversal – often 10’s or more different node types/relationships.? The more joins and relationships that have to be traversed, the less performant the remote store is likely to be at executing the query vs. a native graph implementation.? The problem is made worse by the fact that some of the relationships between the entities might only exist in the graph store.? As a result, the deep traversal will be executed in piece meal fashion with large intermediate results of the partial traversals needing to be processed in the graph-store.

This aspect is one that quite often pushes for the “Metadata + Key Attributes + Common Search” implementation in which the graph store has a copy of all the key entities spanning all the data sources and reduces any federated query to a point query.? The next two sections detail some of the complexities when the federated query is more than a simple point query.

Relocatable Joins/Traversals

The question is – what happens if the results aren’t small??? The solution is a technique that SQL remote table access implementations termed “relocatable joins”.?? The best way to describe this is to consider the following question:

  • Which features do products of a specific product type have that have a 90th percentile review of 4 stars or higher in the past year compared to products of a similar type with 90th percentile review 3 stars or lower during the same time period.

The key to this is recognizing that bulk of the data processing will be the aggregation in the remote data source and that the local data will be a small portion.?? As a result, RDBMSs typically implemented this via a relocatable join by doing processing similar to:

  • Filtering the products locally based on the specific product type.
  • Joining with features to get a product-feature result set.
  • Transferring the product-feature records to the remote data source into a temp table
  • Performing the join and aggregation in the remote data source
  • Returning the final result from the remote data source via a passthrough in the local source

There are two key requirements to this query performing well:

  1. The local GDBMS understands the data distribution between the data sources – not only from a total volume, but specifically based on query predicates.
  2. The local GDMS can re-write the query to leverage the relocatable join in an optimal fashion for the remote RDBMS.

The first one takes a bit of knowledge on the behalf of the metadata catalog.?? It would need to not only retain the schema from the remote DBMS, but also which properties were indexed and the index statistic heuristics for the values.?? Consider the following hypothetical query planning:

  1. Consider ProductType – estimate a single row.
  2. Consider traversal/join to Product – estimate ~100 products based on average 1:100 ratio.
  3. Consider traversal/join to ProductFeatures – estimate 1000 total records due to average 1:10 ratio of Product:ProductFeatures
  4. Consider Reviews in past year – estimate there are 200000 reviews (out of 1 million) based on index heuristics.
  5. Consider traversal/join from Product to Review – average of 2500 reviews per product for 100 products (estimated) is 250000 records.
  6. Consider aggregation for percentile on reviews – 100 records for 100 Products.

The point is that only when the local graph DBMS understands that it will be attempting a distributed join of 1000 records locally to 250000 records remotely will it be able to make the decision to execute via a relocatable join strategy.

Could it do the same query via the cursor method??? Absolutely.? But it would need to fetch (arguably) 250000 records from the source.?? Another optimization might be to realize that it could simply fetch the 90th percentile score for each of the products and that might actually be the fastest – but that gets really into the query rewrite consideration (next topic).

Some of the EII engines attempted to bypass the cursor solution and relocatable join query optimization by caching significant portions of semi-static data in the EII engine itself.? If you are familiar with query execution and join logic, the problem with this approach is that it almost always dictated that the cache tables be the outer tables in joins as the lack of indexes on the cached versions would lead to slow query times with repetitive in-memory scans that could quickly exhaust CPU using Nested Loop Joins and introduced other problems with Sort Merge Joins and then devolved frequently into Hash Join implementations.?? Hash Joins probably are the best solution for this if there are search predicates for both the cached tables and intermediate results from remote data access – but they require a lot of memory and CPU as well.

Back on the topic at hand, such costing will depend on maintaining the index heuristics in the local DBMS for the remote data.?? This can be an expensive operation using a simple approach.? In a previous life, a product I worked with accomplished this by simply sending a sequence of:

SELECT count(*) FROM <table>

/* repeat for each column in the index keys */
SELECT <column>, count(*) FROM <table> ORDER BY <column>?? 

/* possibly repeat for each subset of leading index keys */
SELECT <columnList>, count(*) FROM <table> ORDER BY <columnList>        

to the remote store to obtain data distributions on the index keys individually and as a set so it could develop heuristics locally.? This effort can be reduced if the remote store can export the statistics and the local store can import them in the desired format.? This is commonly supported between two DBMS instances from the same vendor, but rarely between different vendors.

Query Rewrites/Decomposition

The full implementation of a federated query often depends on the ability of the query execution engine to decompose the query or rewrite portions of the query to push down as much of the work as possible to the remote DBMS engines.? This can be looked at from 4 different capabilities:

  1. Query fragmentation based on remote schema
  2. Result set merge (including with aggregation)
  3. SQL (or other remote dialect) capabilities in the remote DBMS (e.g. SQL construct support)
  4. Graph query rewrites

At the simplest level (#1), the federated query needs to understand to only send the query fragment for which the remote schema supports.?? If the full query was pushed down, and the remote schema didn’t have one of the node labels, it would either result in a syntax error or zero rows returned (which arguably would be worse – at least with an error you know there is a problem).?

Going a step further (#2), a common initial easy example of query decomposition is in sharded/MPP implementations.? Simple aggregations have to be rewritten such as:

  • Sum() -> sum(sum(shard))
  • Count() -> sum(count(shard))
  • Avg() -> sum(sum(shard))/sum(count(shard))

This is easy on high level aggregates – but what if it is grouped by columns that do not include the shard key – e.g. such as grouping by city+state when it is sharded by date??? In this case, the federated query will need to get intermediate aggregate results from each of the nodes and perform the final aggregation locally after merging the rows.?? To do this, it typically requires creating a hashkey on the grouping columns to speed the merge process.? This may require a lot of memory to hold the intermediate results – or in some cases spilling to disk.? It may also require creating temporary indices on the intermediate results to facilitate merge joins or subsequent aggregation.

I mention this because in the example above (products with 90th percentile…), is that rather than doing a cursor for each row, the planner could send the percentile aggregate with an IN-clause listing batches of product id’s and then use the hashkey lookup to merge with the other intermediate results.? This would reduce the network latency – which could be significant depending on the location of the remote sources.

Going a step further (#3), the ability to even push down query fragments can be not as straight forward as you think.? At my previous employer, there was a set of ~30+ different criteria that allowed pushdown of particular SQL constructs to the remote database based on the remote DBMS capabilities.?? For example, whether the remote store supported [AND, OR, NOT] in WHERE clause construction (most do), support for IS NULL, IN-clauses (vs. decomposing into multiple OR constructs), specific aggregate functions, nested subqueries, derived tables, UNION/UNION ALL, DIFFERENCE – and a variety of other SQL operations, case sensitivity, etc.

Virtual graph queries (#4) bring a nuance that RDBMSs often don’t need to consider – the notion of Variable Length Patterns or Qualified Path Patterns.? Consider the following cypher for a hypothetical airline reservation:

//find all the paths from Philadelphia to Honolulu with only 3 stops at a maximum
MATCH p=(n:Airport WHERE n.code=”PHL”)-[:HAS_FLIGHT_TO*1..3]->(m:Airport WHERE n.code=”HNL”)
RETURN p        

Now, assume the airports and flight information is in the remote store and it is an RDBMS.? This type of query would need to be re-written as:

// direct flights
SELECT f.from_code, f.to_code
  FROM Flights f
  WHERE f.from_code=”PHL”
    AND f.to_code=”HNL”

// add 1 hop flights
UNION ALL
SELECT f1.from_code, f1.to_code, f2.to_code
  FROM Flights f1
           JOIN Flights f2 ON f1.to_code = f2.from_code
  WHERE f1.from_code=”PHL”
    AND f2.to_code=”HNL”

// add 2 hop fights
UNION ALL
SELECT f1.from_code, f1.to_code, f2.to_code, f3.to_code
  FROM Flights f1
           JOIN Flights f2 ON f1.to_code = f2.from_code
           JOIN Flights f3 ON f2.to_code = f3.from_code
  WHERE f1.from_code=”PHL”
    AND f2.to_code=”HNL”        

If the variable length pattern was longer – this pattern would continue.? The above could be pushed down to the remote store as a single query.?? Consider the following expression, however:

//find all the paths from Philadelphia to Honolulu
MATCH p=(n:Airport WHERE n.code=”PHL”)-[:HAS_FLIGHT_TO*]->(m:Airport WHERE n.code=”HNL”)
RETURN p        

Now the local DBMS would have to iterate through the above pattern and push each down in series until there were no more results.

Just my $0.02….

Currently, Neo4j doesn’t support heterogeneous database federation – and I’m not sure it would be a high priority feature to implement.?? Over the past 4 years, I can count on one hand the number of times potential customers have brought it up – and in every case, they were enamored with the idea of a completely virtual graph to eliminate the increased storage costs and complexity of data movement into yet another data store.?? The problem I see is that this approach is just repeating history that I have lived through twice now.?? Maybe in the future, new approaches and technologies might make it more realistic….but for now….not.??? Until then, I think the best approach is to implement the full knowledge graph in the graph store – and just be extra careful in trimming any “decoration” to minimize data not relevant to the graph analytics.

?

?


要查看或添加评论,请登录

社区洞察

其他会员也浏览了