登录查看更多内容

A Dictionary of Graph Terms

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

发布日期: 2019年12月15日

In the process of working through some documentation for my new company, I put together a general list of terms and phrases used frequently in the Semantics and Property Graph space. It is far from comprehensive, but if you've ever read my or other writers' articles, you may have run across terms that were unfamiliar. I'm publishing this here, with the intent of periodically updating it as time and resources permit.

Node. A node derives from the term knot, as might be applicable in a net. Mathematicians liken nodes to points, and as such, it's useful to think of them as points of interest in what is typically called graph theory. From a graph, standpoint, nodes are usually considered entities, states or typed values. In RDF, nodes are typically associated with entities or concepts and are called subjects or objects.

Node name. A node can have a unique name, but it can also have a type name. In XML or JSON, the node identifier is give as a generic identifier, while the type is the node name. In RDF, the node has an identifier, but also typically has a type relationship to another node that identifies classes.

Edge. An edge is the line that connects two or more nodes. Typically edges may also have some kind of identifier associated with them. A node can exist without edges, but an edge always has at least two nodes. In RDF, edges are typically associated with a single relationship or attribute and is called predicates. In a property graph, edges usually have a similar association, but

Graph. A graph consists of a mesh of nodes connected by edges. A molecule is a graph (and RDF graphs and molecules actually have a lot in common). A relational, Codd-based database which consists of tables, rows, and columns is a graph. So is an XML or JSON document. In general, there are very few data structures that cannot be represented as a graph.

Directed Graph. In a directed graph each edge has a preferred traversal direction. In a structured document such as XML or JSON, the direction of the graph is from a root to each leaf (a node that has an inbound edge but no outbound edges). In an RDF graph, the direction is generally from a subject (the thing being described) to the object (the related entity or attribute).

Cyclic / Acyclic Graph. In a cyclic graph, it is possible to traverse a path that will result in a contained loop, while in an acyclic graph, no such loop traversals are possible. Relational databases are constrained by convention to be acyclic, and (so long as there is no possibility for linkages between documents) document graphs are normally considered acyclic as well. In RDF it is possible to have cycles, but the database generally warns when such cycles occur, as they can have a significant negative impact upon performance. Property graphs are cyclic.

Directed Acyclic Graphs / Directed Cyclic Graphs. A directed acyclic graph (or DAG) is a graph that both has a specific traversal direction and no loops. These are generally known as trees or hierarchies. As a general rule of thumb, most taxonomy systems, as well as XML and JSON structures are DAGs. RDF and Codd model relational databases may be cyclic, but in general, because they are directed, most query algorithms will never visit the same node twice.

Connected / Disconnected Graphs. In a connected graph, a traversal algorithm should be able to traverse all of the nodes and edges in the system. A disconnected graph, on the other hand, will usually consist of islands of connectedness, with some nodes not traversable if started from certain other nodes. Fully connected graphs are considered to have referential integrity and are usually said to follow the closed world assumption (cf. Open / Closed World Assumption, below). A semantic graph is disconnected, and by extension does not assume referential integrity.

Closed / Open World Assumption. In the closed world assumption, all graphs are connected. This implies referential integrity - every foreign key has a corresponding primary key in the system and vice versa. Everything within the database is known. In the open-world assumption, on the other hand, graphs are disconnected - you may have a foreign key reference to an entity that does not yet (and may never) exist. You may not know what you don't know. OWA systems have some peculiar traits, but on the flip-side, there are no stored null values in a semantic graph.

Resource Description Framework (RDF). Initiated in 1999, and first publicized in an article in Scientific American by Tim Berners-Lee (TBL) and cowriters in 2004, the resource description framework is an abstract language used to describe assertions vectors - node-edge-node constructs that mimic a formal logical system or FLS (as first laid out by Whitehead and Russell in the Principia Mathematica, 1910-13). The original RDF specification was small, though was extended with a first pass schematic language (RDFS) produced in 1999. RDF by itself is not a language but rather a metalanguage for describing how to build assertions, with its physical serializations now running in the neighborhood a nearly twenty different representations, including several in JSON.

Semantics. The term semantics generally applies to graph technologies that make use of the RDF stack. It comes from the Greek word for meaning and was first used by Tim Berners-Lee to describe the Semantic Web.

OWL. The Web Ontology Language is a formal logical system (FLS) built using the RDF Framework. It provides a set of constraint entities and models that allowed for deep inferencing, though at the expense of a fair amount of complexity. The language was upgraded in 2012, with OWL now considered the canonical version.

Ontology. An ontology is, as described by technologist Tom Gruber, a specification of a conceptualization. It is a formal description of the relationships between types of entities and the constraints that define and circumscribe them. An ontology is often referred to as a semantic data model and is used in research (and increasingly in business) to describe the concepts and relationships that are pertinent to those domains. More generally, an ontology can be thought of as one why of establishing the grammar and vocabulary of a domain language.

Federated Data. While the notion has been around for a while, federated data is data that can be cleanly exchanged and queried across multiple distributed data systems. RDF data especially is well suited to this ability to query and update across multiple systems, making use of data from one system to better create and analyze new information in other semantic systems. The same idea can apply to non-semantic data, but the poorer data models and complexity of RDBMs systems, in particular, make integration there a bigger issue.

Linked Data. A term first used by Tim Berners Lee in the 2004 Scientific American article, the idea is that semantic data was fundamentally tied to the notion of linking, used so successfully on the web. Ontologies can be linked together, as well, so that you can build up working ontologies from a solid established core rather than building things out from first principles. Linked data applies to both of these - the connections of ontologies and the (soft and dynamic) connections of linked federated data repositories.

Semantic Graph. A semantic graph is one that is but on the foundation of RDF or related technologies (such as GraphQL, which can query a variety of graphs for simple construction and mutation. Most of these use the subject/predicate/object assertion paradigm for building out logical tuples in specially indexed databases known as triple stores or graph stores.

Triple / Tuple. A triple is a sequence of three values - a subject, predicate and object, that define an assertion, where each of these are either an identifier to a specific resource or a literal value. Triples are usually indexed so that if you have multiple identical triples, they only appear in the index once. Many triple stores actually keep track of additional values, such as a graph identifier differentiating a triple in one collection from a triple in another, as well as reification values that identify the triple as an entity in its own right. As such the term tuple has become more common to describe these assertions.

Triple Store / Tuple Store. These are a specialized kind of databases that are optimized for working with triples/tuples. They usually have an RDF compliant query language (typically SPARQL) and have ways of managing various collections of tuples. These are also frequently referred to as Semantic Stores or RDF Stores to differentiate them from property graph stores.

Turtle /Trig. Short for Terse RDF Language, turtle is a compact representation of tuples that is frequently used for ingestion and output into triple stores. SPARQL syntax is a generation of the Turtle syntax. Trig is a form of Turtle with support for named graphs.

SPARQL. This was one of the first languages developed to query RDF graphs, being first published in 2007 and then revised in 2013. It focuses primarily on a selection and join model not dissimilar to how SQL works and should be thought of as the logical successor of SQL in the graph space. It is likely that it will be revised to a 2.0 version sometime in the early 2020s.

SPARQL Update. A data definition language for SPARQL that makes it possible to use rule-based queries to add, modify or delete tuples from the application graph. It is seeing increasing use after its release in 2013 and is beginning to replace proprietary update solutions.

SHACL. The Shape Constraint Language, or SHACL, is an attempt to introduce a schema model that more closely aligns with the schema models employed by XML and JSON (and that are more familiar to most commercial enterprises), rather than the more complex OWL models that are used most heavily in academia and research. SHACL takes advantage of SPARQL to perform query oriented validation of documents rather than consistency based ones and has been extended both to provide a reporting language when a shape (a pattern of nodes and edges) fails to validate. It is also used increasingly for purposes of creating dynamically generated user interfaces.

Knowledge Graph. A knowledge graph is (usually) a semantic graph that is specifically designed to hold interconnected reference data, publishing and annotational content, and a core low-level operational ontology to facilitate search, navigation, and discovery over data processing. Knowledge Graphs can also function as data catalogs or digital asset management systems. The key distinction with knowledge graphs is that they generally focus on categorizational metadata, rather than numeric data, and they are intended at least somewhat to be human-understandable, searchable and navigable.

Semantic Data Catalog. This is a semantic graph that is intended to identify where the information in a given organization resides, how that information is structured, and what governance that data is under. In general, a knowledge graph concentrates on curatable information, while an SDC focuses on finding data assets.

Semantic Data Hub. This may be a semantic graph or property graph, and serves to provide a way to transform data information from Smartone ontology (or data schema) to another. Most such data hubs are not as focused on human use as they are on providing a bridge between various system representations. They usually work in conjunction with a semantic data catalog, with the catalog identifying the resources to work with and the data hub then taking the appropriate transformations from the catalog to convert to a common canonical model and out again, in order to minimize combinatoric explosion. It should be pointed out that any transformation is likely to be lossy. This is an area where machine language and even more sophisticated AI will likely play a bigger part in the future.

Semantic Exchange. The use of knowledge graphs to facilitate the exchange of information, products, or services, either freely given or for a price. Exchanges are only lately becoming feasible, given performance and design considerations of tuple stores, but generally rely upon the rich interplay of metadata surrounding buyers, sellers, products, and jurisdictions.

Semantic Compliance Systems. Almost the flip-side of exchanges, semantic compliance systems keep track of regulatory requirements in different jurisdictions pertaining to a given domain, and can be used to test whether a given resource is in or out of compliance given that framework (as well as determining what steps are needed to make the resource compliant).

Smart Contracts. Smart contracts bridge the middle ground between exchanges and compliance systems, in effect, both providing the relevant legal framework as reference to enable the contract compliant in a given system, along with the ability to manage, through a set of rules (possibly enabled in SHACL) that initiate actions when the relevant conditions are reached.

Semantic Immutability. The idea that once an entity is created, it exists over a period of time and changes should take place as the creation of new relationships with new entities rather than the modification of existing ones. The jobs that a person has, for instance, are immutable - they can be superseded, (which is a back pointer from a new entity), but the entry itself, once created cannot be changed. This notion is also known as concurrency, and while it plays a big part in designing long term systems, semantic immutability is key to transactional systems as well.

Distributed Ledger. A distributed ledger is a recording of transactions that have been verified and sealed cryptographically. Semantic systems can be used for building distributed ledgers (so long as a mechanism for surety through immutability and verifiability is present) or can support transactions with other distributed ledgers, such as blockchain, by providing external metadata and managing cryptographic keys.

Reification. Assertions (or tuples) can be thought of as resources in their own right. Some assertions may be qualified (Jane wins the lottery, for instance, may be given a probability of .00001242 %), or may have an authority associated with it (John says that Jane won the lottery). When an assertion is about another assertion, the referenced assertion is called reification. The primary distinction between a property graph and a semantic graph is that in a property graph the edge (or relationship) can carry attributes, while in a semantic graph, a reification node has to be set up that points to the individual components of a tuple. A property graph is somewhat more optimized for accessing such relationships (SPARQL can access them, but it's cumbersome). It is likely that with a few relatively simple changes, to RDF and/or SPARQL, models can be made that unify the two types of graphs.

Property Graphs. A property graph is a more generalized form of a graph that incorporates a reification layer upon edges and does not absolutely require directedness. It is possible to model a semantics graph upon a property graph and vice versa, so the primary difference between the two tends to be the degree to which metadata is optimized and which query languages are involved, with property graphs lending themselves better to calculation and machine learning while semantic graphics focus primarily upon language structures.

Gremlin / TinkerPop. TinkerPop is an Apache-based framework for performing graph-based processing, while Gremlin is the language that TinkerPop uses to express the actual traversals. G/TP. Gremlin is available in BlazeGraph, Amazon Neptune and OrientDB.

GraphQL. A language designed by Facebook for querying document and semantic databases in order to generate JSON output (as well as limited mutation or update capabilities). GraphQL support is on the roadmap for Gracie.

NodeJS. This is a Javascript interpreter developed by Google, which is rapidly replacing older generation PHP as a web application and integration layer. It can be run as a server or application, though is also used with Angular, Vue, React, Aurelia, etc. to compile web client applications. There are a number of RDF libraries available for NodeJS, and many triple stores also provide some form of node interface.

Jena / Fuseki. Jena was the first open-source graph database, built for use with Apache. While it can be used as a reasonable good basic tuple store, its primary benefit is that it is frequently used as a platform for building proof of concept

Rule / Rule Chain. A rule consists of a specific SPARQL query or SHACL shape in which an inbound or requested outbound RDF record may match, and which, if validated, will generate a report. The report, in turn, will be tied to a specific action or sequence of actions. Rules have associated priorities, with higher priority rules superseding lower priority ones. Rules may absorb test entities, terminating the chain, or may re-emit the test entity for the next lower priority rule in the sequence.

Inference Rule. The use of a pattern of assertions to explicitly create new information (or clarify implicit knowledge). For instance, "if two people share the same parents then they are siblings" is an inference rule. If John and Jane both have Ralph and Mary as parents, then, by the above inference rule, John and Jane are siblings.

Inference Engine. An inference engine is a triple store or similar semantic database in which, once an inference rule is created, the inferred triples are automatically surfaced. Newer semantic graph systems disable inferencing by default in favor of explicit queries and asynchronous invocation of "realized" triples The former process is frequently referred to as forward chaining, while the latter is more often called backwards chaining.

Inferred Triples. Inferred triples are those triples that derive from an inference rule rather than those that come from a priori assertions from external sources. Often, such inferred triples provide a convenient index for more complex relationships. For instance, a company may have multiple CEOs over the course of its existence, but only one of these will be the CEO at the current time. An inference rule could be written that look at all employees with the role of CEO, order these by the reversed time that they began their tenure, then create a new triple that removes the singleton "has CEO" property and creates a new one based on the newer date. This not only keeps the information up to date automatically, but also asserts a simple relationship that represents a much more complex one.

Assertion. In semantic terms, an assertion is a statement (or set of statements) that an authority is treating as being true for purposes of discussion. Note that such an assertions may not in fact describe empirical reality. An assertion would be a set of statements such as "A unicorn is a horse with a single horn emerging from its forehead. Binky is a horse with a single horn emerging from its forehead, therefore, Binky is a unicorn." If you included the expression "(hypothetical)" before unicorn and Binky, the assertion would be true regardless of the fact that unicorns don't exists in reality (arguably - which is the distinction between a rationalist and an empiricist).

Kurt Cagle is the CTO of the Semantic Data Group, and writes frequently on Linked In and elsewhere on semantics, graphs, and metadata management issues (among other things). He can be reached at [email protected].

Anirban Bhattacharjee

Knowledge Graph enthusiast with a specific interest in capturing semantics.

5 年

Exceptional post!

Fabian Pascal

Editor &Publisher DATABASE DEBUNKINGS, Data and Relational Fundamentalist,Consultant, Analyst, Author, Educator, Speaker

5 年

Semantics and graphs are at two levels of representation. Here's some important data fundamentals little known and understood in the industry: https://dbdebunk.blogspot.com/p/terminology.html

1 次回应

Anthony Bufort

Senior Software Engineer @ AJB Consulting, my XML-tech consultancy. Acrylic Artist @ Bufortistry Studio Arts.

5 年

Thanks very much for this, Kurt.? Very useful.

Irene Polikoff

5 年

For edges, I would say something like: In RDF, edges do not uniquely identify node-edge-node statement. Both, 'John-parent-Ralph' and 'Jane-parent-Ralph' statements are sharing the same edge. Edges in RDF are called predicates. In a property graph, each edge is unique. This means that 'parent' edge connecting John with Ralph and 'parent' edge connecting Jane with Ralph will each have their own unique ID. They will both have the same name "parent" to help identify that they represent the same relationship. I would also add for nodes: In RDF, typed values (literals) are nodes that can appear only at the end of a relationship, as objects. In a property graph, typed values are not considered to be nodes and a link between a node and a typed value is not considered to be an edge. Typed values are stored as property-value pairs associated with nodes or edges.

1 次回应

Michael Pool

Semantic Technology Leader

5 年

I have stopped referring to ontologies as semantic data models because one's ontology may function for other reasons than pure data modeling, e.g., one may use it for disambiguation in NLP, framework for a reasoning system.

4 次回应

查看更多评论

要查看或添加评论，请登录

Kurt Cagle的更多文章

Reality Check

2025年2月22日

Reality Check

Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

14 条评论
MarkLogic Gets a Serious Upgrade

2025年2月15日

MarkLogic Gets a Serious Upgrade

Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

14 条评论
Beyond Copyright

2025年2月9日

Beyond Copyright

Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

5 条评论
Beware Those Seeking Efficiency

2025年2月8日

Beware Those Seeking Efficiency

Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

86 条评论
A Decentralized AI/KG Web

2025年2月1日

A Decentralized AI/KG Web

Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

48 条评论
Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

2025年1月26日

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

I am currently working on Deepseek (https://chat.deepseek.

41 条评论
The (Fake) Testerone Crisis

2025年1月15日

The (Fake) Testerone Crisis

Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

22 条评论
Why AI Agents Aren't Agents

2025年1月15日

Why AI Agents Aren't Agents

Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

22 条评论
What to Study in 2025 If You Want A Job in 2030

2025年1月12日

What to Study in 2025 If You Want A Job in 2030

Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

28 条评论
Ontologies and Knowledge Graphs

2025年1月9日

Ontologies and Knowledge Graphs

Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

53 条评论

See all articles

A Dictionary of Graph Terms

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

Kurt Cagle的更多文章

社区洞察

其他会员也浏览了

Simplifying key Data Science Concepts! (drafted by Dr Ratika Datta)

Embedding Distance To Enhanced Answer Quality: A Simple Dive

Building 10 Classifier ????Models in Machine?Learning + Notebook

Korvus: The Future of Efficient AI Workflows with In-Database RAG

Journey To Database World: Part 10 (Vector Database - Qdrant As Example)

Learn what’s coming with Milvus 2.5, RAG Evaluation, and A Guide to Choose a Vector DB for You

From PDFs to Insights: Qdrant Vector Search Explained

Text to SQL and the Translation Problem in Data Disciplines

When Short Queries Meet Long Documents:

How do Data Science and AI help real estate Companies?

Kurt Cagle的更多文章

Reality Check

MarkLogic Gets a Serious Upgrade

Beyond Copyright

Beware Those Seeking Efficiency

A Decentralized AI/KG Web

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

The (Fake) Testerone Crisis

Why AI Agents Aren't Agents

What to Study in 2025 If You Want A Job in 2030

Ontologies and Knowledge Graphs

社区洞察

其他会员也浏览了

Simplifying key Data Science Concepts! (drafted by Dr Ratika Datta)

Embedding Distance To Enhanced Answer Quality: A Simple Dive

Building 10 Classifier ????Models in Machine?Learning + Notebook

Korvus: The Future of Efficient AI Workflows with In-Database RAG

Journey To Database World: Part 10 (Vector Database - Qdrant As Example)

Learn what’s coming with Milvus 2.5, RAG Evaluation, and A Guide to Choose a Vector DB for You

From PDFs to Insights: Qdrant Vector Search Explained

Text to SQL and the Translation Problem in Data Disciplines

When Short Queries Meet Long Documents:

How do Data Science and AI help real estate Companies?