Introduction to Knowledge Graph
Data data data - data everywhere
As few years ago people have started to talk about Docker, containers, Kubernetes, Cloud and its flexibility, in last couple of years people have started to talk about data and data-driven decisions. So, as almost all companies want to run containerized workloads, all companies want to make their decisions based on data nowadays.
What are the main challenges when you work with data? How to store and structure them on the side which produce data? How to ingest them into something bigger? Cleansing? Processing? Consumption patterns? Performance? Security?
Where people really struggle is understanding of data, easy scaling and speed of processing, especially during data transformation. You have tens or hundreds of tables, thousands of tags, all of them somehow relate each other and all of them means something in the context of business. Moreover, all of them might be and most of the time are dependency for other data sets in sense of bigger picture, so once processing of one data set failed, relevancy of other data set might be significantly decreased, therefore you want to be informed - not only about failure itself but also about where exactly in data lineage failure happened and what does it mean for your data pipeline and for the value of your data product.
Let's talk what we can do with that from technology standpoint.
What is Knowledge Graph
As terminology says, Knowledge Graph is semantic network which?represents a network of real-world entities — i.e. objects, events, situations, or concepts—and illustrates the relationship between them.
What does it really mean?
When you have a data stored in tabular format as we do it for decades within the tables, you have hundreds of rows and hundreds of columns across tens of tables, and by looking to it and trying to understand how all those datatypes and values construct the end value for which you create and store all of this, you get a strong headache and straight away you want to get alone in the universe instead of analyzing what these data mean.
By getting and storing data in the Graph, you automatically get data elements literally connected each other, e.g.:
What is the difference between these two representations? Actually, it is not only about the representation, it is about different type of knowledge graph, or better to say, knowledge model.
Heart of Knowledge Graph
The heart of knowledge graph is a knowledge model which is collection of interlinked descriptions of objects, events and relationships.
Types of Knowledge Model
Property Graph - consists of simple concept of? node – edge – node? where nodes represent who or what, edges represent relationship so how or when or activity.
Nowadays it is encoded in JSON format which allows scalable manipulation with these data. Mostly used query languages are Cypher, openCypher, Gremlin, PathQuery.
RDF* - RDF is over the years extended version of RDF. This knowledge model consists of triples as subject node – predicate – object node where this triple is basic unit of data within the graph, instead of simple collection of arbitrary data structures, some of which can point at other structures, so RDF is needed for having a semantic layer. For traversal of RDF graph is used SPARQL language. We could be talking here also about a quads, but for simplicity let's stay with triples.
As you can see above, the predicates in the age of data and reads article are connected together into one element and are connected to subject node and object node, but unlike in Property Graph knowledge model, here both elements are separate entities within the model, so can be connected to any other nodes or predicates within the model and by that to create real semantic network.
Why we need both
In?Property Graphs, attribute values are just strings not linked to any other nodes in the graph. This means, Property Graph is much simpler than RDF* in the meaning of thick of data model, therefore:
Values of RDF* properties can be:
RDF* is much thicker than Property Graph but as benefit it provides interconnection which gives you overview of whole data in domain that you cover, e.g. history of tasks/events, data lineage, metadata, etc., so can be easily used for further analytics where information about whole data network is required since it allows you to create a semantic layer.
Another comparisons:
Semantic Layer
Above we mentioned semantic network and semantic layer.
Semantic layer is abstract layer which we create in the model by constructing semantic network there.
What theory says about semantic layer
A semantic layer is?a business representation of corporate data that helps end users access data autonomously using common business terms.
A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization.
Something from real world
Semantic Layer is expressed by Ontology.
Ontologies are backbone of formal semantics of knowledge graph. In data model they are seen as schema of the graph. Main types of schemas are RDFS and OWL.
What do I get by having semantic layer?
So what do you get from having semantic layer present in your data model? You get inferred facts, which means, by knowing relation of Object A and Object B and by knowing relation of Object B and Object C, you can deduce relation of Object A and Object C.
Here we can see simple example of RDFS and semantic layer present there:
The rdfs:range of an rdf:Property declares the class or datatype of the object in a triple whose predicate is that property.
rdfs:subClassOf allows declaration of hierarchies of classes.rdfs:subClassOf allows declaration of hierarchies of classes.
rdf:type is a property used to state that a resource is an instance of a class.
More about RDF schema: https://en.wikipedia.org/wiki/RDF_Schema
Someone could say that Property Graph is not real Knowledge Graph. I do not agree, but true is, if you want to build a Graph with semantic layer and real semantic network there, you need to use RDF as knowledge model, as it is mentioned above.
Why to use Graph over tabular format
领英推荐
Graph approach gives you:
How to do it?
To get data present in Knowledge Graph, you need to perform several steps:
Once you have done this along with standard activities around infrastructure configuration, you get a data present in knowledge graph ready for data consumption. Let's describe it a bit more:
We have decided to test processing into and from Knowledge Graph constructed as property graph for lot track & trace (tracking and tracing of what happens with material lots in their lifecycles) and to make it with AWS services.
When you ingest data to graph database, you need to do it separately for nodes and for edges, therefore you need to have list of nodes and edges separated. It is very easy operation, as of getting the lists as well as getting CSV constructed. That's what we did in first step.
In second step we ingested data into graph database as property graph , in logic by which we get data model as per business needs.
In third step we consumed data from graph and as you can see it was very fast. It has been consumed and visualized in form of paths - details for paths vs. trees we will explain later.
Step 4 was done just to verify how long it takes to get it back to tabular format - simply said, to confirm it does not make a sense at all (here from time perspective).
Let's take a look to bit more numbers
In the picture above we can see similar numbers like in the picture before, but here wen can see also total number of records that we processed into Property Graph as well as comparison of data retrieval in form of Path vs Trees.
Total Number of records that we processed was 46.2 millions and we processed it in approximately 6 minutes as you can see there - 1 minute for generating CSV files, 5 minutes for generating our data model (2 for Nodes). It means, in 6 minutes we have got data from state in which for business it is not valuable into state in which business can understand the lifecycle and dependency of all processed lots.
Above we mentioned Paths and Trees. As you can see in this picture, retrieving data in form of Paths is much faster than retrieving them in form of Trees. Difference there is that if you get it as a Path, you get separate JSON for each "connection" of node; at the other side, for Trees you get everything described in one JSON. So, for some very specific cases where you need easy manipulation with outputs represented in and manipulated via these JSONs, it might be better to go with Trees, but generally, Path is better and it gives you also bigger flexibility.
This was PoC, so in production workload the pipeline would look little bit differently, from perspective of what is the source of data and what are the triggers for each step, but the point stays same - getting it into data model constructed as Property Graph is much faster than getting data model constructed by tables and same for data retrieval.
What use-cases are good to be covered by Knowledge Graph?
Manufacturing
Supply Chain
Supply Chain Optimization
Logistics
BoM management
?Financial industry or generation of 360° views of customers and products
Data Catalog and Data Lineage
Data Catalog
Data Lineage
In last two we have clearly come to the points mentioned at the beginning - understanding of data. Understanding of data which is not only about the relation of data entry to the other one in data model, but also understanding of data itself, so what these data mean in my business domain, for my business (data catalog), and what affect it does have to the business/data product if part of the pipeline failed (data lineage).
Data Lineage
What do we get by storing Data Lineage in form of Knowledge Graph?
Lineage data are highly connected data – huge number of relationships - complexity of storing and scaling is smaller when storing it as Knowledge Graph
Traversing these data is much easier in graph than in tabular format where multiple tables need to be joined and then processing into final object for visualization needs to happen - complexity & speed of delivery is better with Knowledge Graph
Data Catalog
I think, above picture is self-explanatory. It is very simple extract of how Data Catalog can look like. Of course, it is much simplified when we compare it to what can be part of Data Catalog and what can be beneficial for business or other IT departments that need to work with data. For instance, it can be enhanced by information about stock_movements in terms of its representation in business context, where is Data Asset X used, etc.
Data Catalog is about “one to many” relationships where every addition of next element increase the complexity of expressing the relationships exponentially.
By Knowledge Graph we avoid:
Final Thoughts
I hope above text has given you information what Knowledge Graph is about as well as why, where and how it can be beneficial to use it.
Forms of data retrieval and visualization is something what can get own article each, and it was not intended to be described in the detail in this one.
Next time we will take closer look to differences of RDF and OWL schemas, and later on we will say something about how all of it can be used in modern enterprise data architecture.
AI Specialist @ IBM: Helping clients optimize their business processes with Trusted AI designed specifically for business use
1 年Excellent description of a Knowledge Graph, and most importantly, the value a KG Semantic Layer brings to the business.
Leader in Digital Innovation, R&D, Digital Industry 4.0
1 年nice quick intro with examples
Enterprise Solution Architect
1 年I cannot forget to say Thank You to Florian Jantscher who brought this path of "doing the things" as an idea, which in the end has a potential to enrich and simplify many things and is really interesting from technology and architectural standpoints. thanks Florian ??
Driving Decisions with Data | HEC Business School Paris
1 年Thank you Karol Lacuska for sharing this post! Great to see real life applications for manufacturing and supply chain where this approach can be particularly well suited Vinay Muniyappa. There is equally potential for finding relationships in data and making information more visible via data catalog or connecting with Enterprise Search David Feldman Joshua K. Cliff, PMP, CTT
SAP Consultant at SNP group
1 年Thanks for the great introduction!