Introduction to Knowledge Graph

Introduction to Knowledge Graph

Data data data - data everywhere

As few years ago people have started to talk about Docker, containers, Kubernetes, Cloud and its flexibility, in last couple of years people have started to talk about data and data-driven decisions. So, as almost all companies want to run containerized workloads, all companies want to make their decisions based on data nowadays.

What are the main challenges when you work with data? How to store and structure them on the side which produce data? How to ingest them into something bigger? Cleansing? Processing? Consumption patterns? Performance? Security?

Where people really struggle is understanding of data, easy scaling and speed of processing, especially during data transformation. You have tens or hundreds of tables, thousands of tags, all of them somehow relate each other and all of them means something in the context of business. Moreover, all of them might be and most of the time are dependency for other data sets in sense of bigger picture, so once processing of one data set failed, relevancy of other data set might be significantly decreased, therefore you want to be informed - not only about failure itself but also about where exactly in data lineage failure happened and what does it mean for your data pipeline and for the value of your data product.

Let's talk what we can do with that from technology standpoint.

What is Knowledge Graph

As terminology says, Knowledge Graph is semantic network which?represents a network of real-world entities — i.e. objects, events, situations, or concepts—and illustrates the relationship between them.

What does it really mean?

When you have a data stored in tabular format as we do it for decades within the tables, you have hundreds of rows and hundreds of columns across tens of tables, and by looking to it and trying to understand how all those datatypes and values construct the end value for which you create and store all of this, you get a strong headache and straight away you want to get alone in the universe instead of analyzing what these data mean.

By getting and storing data in the Graph, you automatically get data elements literally connected each other, e.g.:

No alt text provided for this image


No alt text provided for this image

What is the difference between these two representations? Actually, it is not only about the representation, it is about different type of knowledge graph, or better to say, knowledge model.

Heart of Knowledge Graph

The heart of knowledge graph is a knowledge model which is collection of interlinked descriptions of objects, events and relationships.

Types of Knowledge Model

Property Graph - consists of simple concept of? node – edge – node? where nodes represent who or what, edges represent relationship so how or when or activity.

No alt text provided for this image

Nowadays it is encoded in JSON format which allows scalable manipulation with these data. Mostly used query languages are Cypher, openCypher, Gremlin, PathQuery.


RDF* - RDF is over the years extended version of RDF. This knowledge model consists of triples as subject node – predicate – object node where this triple is basic unit of data within the graph, instead of simple collection of arbitrary data structures, some of which can point at other structures, so RDF is needed for having a semantic layer. For traversal of RDF graph is used SPARQL language. We could be talking here also about a quads, but for simplicity let's stay with triples.

No alt text provided for this image

As you can see above, the predicates in the age of data and reads article are connected together into one element and are connected to subject node and object node, but unlike in Property Graph knowledge model, here both elements are separate entities within the model, so can be connected to any other nodes or predicates within the model and by that to create real semantic network.

Why we need both

In?Property Graphs, attribute values are just strings not linked to any other nodes in the graph. This means, Property Graph is much simpler than RDF* in the meaning of thick of data model, therefore:

  • can be used for fast querying and data traversal, especially for applications where read efficiency is important
  • where deep queries that involve sub-graphs are frequent (e.g. lot genealogy, family genealogy, etc.)

No alt text provided for this image

Values of RDF* properties can be:

  • literal values (RDF literals)
  • nodes connected to other nodes in the graph, so the attribute value of one node can be linked to other nodes providing additional context to understand the attribute and treating edge properties as just another node helps contextualize the data.

RDF* is much thicker than Property Graph but as benefit it provides interconnection which gives you overview of whole data in domain that you cover, e.g. history of tasks/events, data lineage, metadata, etc., so can be easily used for further analytics where information about whole data network is required since it allows you to create a semantic layer.

No alt text provided for this image

Another comparisons:


No alt text provided for this image


No alt text provided for this image

Semantic Layer

Above we mentioned semantic network and semantic layer.

Semantic layer is abstract layer which we create in the model by constructing semantic network there.

What theory says about semantic layer

A semantic layer is?a business representation of corporate data that helps end users access data autonomously using common business terms.

A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization.

Something from real world

Semantic Layer is expressed by Ontology.

Ontologies are backbone of formal semantics of knowledge graph. In data model they are seen as schema of the graph. Main types of schemas are RDFS and OWL.

What do I get by having semantic layer?

So what do you get from having semantic layer present in your data model? You get inferred facts, which means, by knowing relation of Object A and Object B and by knowing relation of Object B and Object C, you can deduce relation of Object A and Object C.

Here we can see simple example of RDFS and semantic layer present there:

No alt text provided for this image

The rdfs:range of an rdf:Property declares the class or datatype of the object in a triple whose predicate is that property.

rdfs:subClassOf allows declaration of hierarchies of classes.rdfs:subClassOf allows declaration of hierarchies of classes.

rdf:type is a property used to state that a resource is an instance of a class.

More about RDF schema: https://en.wikipedia.org/wiki/RDF_Schema

Someone could say that Property Graph is not real Knowledge Graph. I do not agree, but true is, if you want to build a Graph with semantic layer and real semantic network there, you need to use RDF as knowledge model, as it is mentioned above.

Why to use Graph over tabular format

Graph approach gives you:

No alt text provided for this image
No alt text provided for this image

  • Better performance – much faster even for large dataset
  • Easier scalability – no need of understanding all relations between the tables when amendment of data content is needed
  • Allows you to make semantic search based on inferred facts (RDF*) – creates additional layer for understanding of data relationships in transferrable way
  • Relationships between data entities are understandable to both – humans and AIs
  • Relationships can have attributes as well

How to do it?

To get data present in Knowledge Graph, you need to perform several steps:

  • Have a data available for ingestion
  • Decide which Knowledge Model suits your business needs
  • Decide on infrastructure components
  • Get data into the format which is suitable for Graph
  • Design the model and create a code for data transformation

Once you have done this along with standard activities around infrastructure configuration, you get a data present in knowledge graph ready for data consumption. Let's describe it a bit more:

We have decided to test processing into and from Knowledge Graph constructed as property graph for lot track & trace (tracking and tracing of what happens with material lots in their lifecycles) and to make it with AWS services.

No alt text provided for this image
w

When you ingest data to graph database, you need to do it separately for nodes and for edges, therefore you need to have list of nodes and edges separated. It is very easy operation, as of getting the lists as well as getting CSV constructed. That's what we did in first step.

In second step we ingested data into graph database as property graph , in logic by which we get data model as per business needs.

In third step we consumed data from graph and as you can see it was very fast. It has been consumed and visualized in form of paths - details for paths vs. trees we will explain later.

Step 4 was done just to verify how long it takes to get it back to tabular format - simply said, to confirm it does not make a sense at all (here from time perspective).


Let's take a look to bit more numbers

No alt text provided for this image

In the picture above we can see similar numbers like in the picture before, but here wen can see also total number of records that we processed into Property Graph as well as comparison of data retrieval in form of Path vs Trees.

Total Number of records that we processed was 46.2 millions and we processed it in approximately 6 minutes as you can see there - 1 minute for generating CSV files, 5 minutes for generating our data model (2 for Nodes). It means, in 6 minutes we have got data from state in which for business it is not valuable into state in which business can understand the lifecycle and dependency of all processed lots.

Above we mentioned Paths and Trees. As you can see in this picture, retrieving data in form of Paths is much faster than retrieving them in form of Trees. Difference there is that if you get it as a Path, you get separate JSON for each "connection" of node; at the other side, for Trees you get everything described in one JSON. So, for some very specific cases where you need easy manipulation with outputs represented in and manipulated via these JSONs, it might be better to go with Trees, but generally, Path is better and it gives you also bigger flexibility.

This was PoC, so in production workload the pipeline would look little bit differently, from perspective of what is the source of data and what are the triggers for each step, but the point stays same - getting it into data model constructed as Property Graph is much faster than getting data model constructed by tables and same for data retrieval.

What use-cases are good to be covered by Knowledge Graph?

Manufacturing

  • tracking the lifecycle of material and its dependency on the other materials


Supply Chain

Supply Chain Optimization

  • Route Optimization – shortest and most cost-effective route between two points
  • Risk Identification and diversity planning – map and assess the value of all major products

Logistics

BoM management


?Financial industry or generation of 360° views of customers and products

  • combination of unstructured data like reviews, articles, legislation and structured data and connecting data silos into unified data model through harmonized metadata
  • can be enhanced by NLP to detect key topics, sentiments and extract additional context, and hidden relationships can be discovered to drive new insights for the organization (e.g. for fraud detection)


Data Catalog and Data Lineage

Data Catalog

  • Information about datasets, data sources, may include data samples, provides links and APIs for access and download, roles, but mostly – contains metadata describing each data asset

Data Lineage

  • Where does data come from?
  • Where does data go?
  • What happens to data along the way?


In last two we have clearly come to the points mentioned at the beginning - understanding of data. Understanding of data which is not only about the relation of data entry to the other one in data model, but also understanding of data itself, so what these data mean in my business domain, for my business (data catalog), and what affect it does have to the business/data product if part of the pipeline failed (data lineage).


Data Lineage

No alt text provided for this image
What

What do we get by storing Data Lineage in form of Knowledge Graph?

Lineage data are highly connected data – huge number of relationships - complexity of storing and scaling is smaller when storing it as Knowledge Graph

Traversing these data is much easier in graph than in tabular format where multiple tables need to be joined and then processing into final object for visualization needs to happen - complexity & speed of delivery is better with Knowledge Graph


Data Catalog

No alt text provided for this image

I think, above picture is self-explanatory. It is very simple extract of how Data Catalog can look like. Of course, it is much simplified when we compare it to what can be part of Data Catalog and what can be beneficial for business or other IT departments that need to work with data. For instance, it can be enhanced by information about stock_movements in terms of its representation in business context, where is Data Asset X used, etc.

Data Catalog is about “one to many” relationships where every addition of next element increase the complexity of expressing the relationships exponentially.

By Knowledge Graph we avoid:

  • unnecessary needs of additional joins and tables
  • creation even small silos due to no possibility to add it to existing data model handled within tabular format

Final Thoughts

I hope above text has given you information what Knowledge Graph is about as well as why, where and how it can be beneficial to use it.

Forms of data retrieval and visualization is something what can get own article each, and it was not intended to be described in the detail in this one.

Next time we will take closer look to differences of RDF and OWL schemas, and later on we will say something about how all of it can be used in modern enterprise data architecture.

Glenn Riedel

AI Specialist @ IBM: Helping clients optimize their business processes with Trusted AI designed specifically for business use

1 年

Excellent description of a Knowledge Graph, and most importantly, the value a KG Semantic Layer brings to the business.

回复
Saso Jezernik

Leader in Digital Innovation, R&D, Digital Industry 4.0

1 年

nice quick intro with examples

Karol Lacuska

Enterprise Solution Architect

1 年

I cannot forget to say Thank You to Florian Jantscher who brought this path of "doing the things" as an idea, which in the end has a potential to enrich and simplify many things and is really interesting from technology and architectural standpoints. thanks Florian ??

Steve Moyal

Driving Decisions with Data | HEC Business School Paris

1 年

Thank you Karol Lacuska for sharing this post! Great to see real life applications for manufacturing and supply chain where this approach can be particularly well suited Vinay Muniyappa. There is equally potential for finding relationships in data and making information more visible via data catalog or connecting with Enterprise Search David Feldman Joshua K. Cliff, PMP, CTT

Peter Lacu?ka

SAP Consultant at SNP group

1 年

Thanks for the great introduction!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了