Network Graph Visualizations with DOT
I've fallen a bit behind on my newsletters of late, as I've taken over the helm of Data Science Central and have been focusing most of my efforts there. The following is an article that I've had in mind to write for a while, the first in what I hope to be a series on graph visualization techniques. Enjoy!
Network graphs play a large part in both computing and data science, and they are essential for working with (and visualizing) both semantic graphs and property graphs. Nearly thirty years ago, AT&T produced a set of libraries called graphviz which were designed to generate various types of printed output. Over the years, the library has been adapted for different platforms and roles, and today is still one of the most widely used network graph visualization tools around.
One of the most common libraries associated with graphviz is the DOT library. DOT is a declarative language that let's users specify the components of a network graph, including the nodes (or dots) that are typically used to represent entities, along with the directed edges that are used to specify relationships or attributes. While there are other things that DOT can be used for, the language is powerful enough to create graphs with hundreds or even thousands of nodes, and serves as the foundation for other visualization libraries when moving out beyond that scale.
At its core, DOT usually specifies directed graphs, or graphs in which each edge has a specific direction of traversal. You can experiment with DOT using the GraphViz Online viewer. One of the simplest such DOT file graphs looks something like this:
The DOT file that describes this graph is about as simple:
digraph G1 { JaneDoe; Programmer; JaneDoe -> Programmer; }
The digraph term indicates that this is a directional graph (one of a number of different kinds of graphs), with the G1 indicating the identifier of the graph. Graphs can be used within other graphs, and as such can also be named. The identifiers can either be alphanumeric sequences (along with the underscore character) or can be strings delimited with quotation marks. These identifiers serve as labels if no labels are otherwise given.
Thus, in this example, the graph identifies two nodes (JaneDoe and Programmer) and one edge (JaneDoe -> Programmer) which creates a relationship between the two indicated by an arrow terminated line from one element to another, which is indicated by a right pointing arrow.
If you have a sequence such as JaneDoe -> Programmer -> Javascript, this creates a set of nodes JaneDoe, Programmer and Javascript if they don't exist already exist, with arrows between them:
digraph G2 { JaneDoe -> Programmer -> Javascript; }
For single word graphs, this is probably sufficient, but you can also "decorate" each node and edge, using attributes and labels. Suppose, for instance, that you wanted to include a class indicator with each node, and wanted some kind of relationship between each node. This where labels come in handy:
digraph G3 { JaneDoe [label="<Person>\nJane Doe"]; Programmer [label="<Profession>\nProgrammer"]; Javascript [label="<Language>\nJavascript"]; JaneDoe -> Programmer [label=" has profession"]; Programmer ->Javascript [label=" has language"]; }
One of the great things about DOT is that you generally do not need to worry about layout, fitting text to different shapes, or managing overlap. In the example for G3, line breaks can be added in with the "\n" newline character, letting you create multiline labels. Similarly, the DOT algorithms will place the edge labels to avoid (as much as possible) overlapping with either shapes or other edges.
The attributes section (indicated by square brackets) can also include other formatting metadata. For instance, suppose that you wanted each class to have its own color, wanted the text to be in Arial (or Helvetica) and wanted the text to be white on dark color backgrounds, you can use additional DOT styles that correspond (roughly) to basic CSS.
digraph G4 { edge [fontcolor=black fontsize="10" fontname=Arial]; node [style="filled" fontcolor=white fontsize="12" fontname=Arial]; JaneDoe [label="<Person>\nJane Doe" fillcolor="blue"]; Programmer [label="<Profession>\nProgrammer" fillcolor="darkGreen"]; Javascript [label="<Language>\nJavascript" fillcolor="green" fontcolor="black"]; JaneDoe -> Programmer [label=" has profession"]; Programmer ->Javascript [label=" has language"]; }
In G4, notice the use of the edge and node descriptors. These provide a way to create a common description for a node or edge that everything in the graph uses. This edge instance, for example, sets the font color (black) font size (10pt) and font name (Arial). The node sets the default color for text (white) and also indicates that the shape in question is filled. Each node entry can then override certain values (such as the font color) for Javascript being set to black because the light green is light enough that white text becomes difficult to showcase.
With DOT, you can also create subgraphs, which can both have associated traits and can serve to aggregate groups of content and apply node and edge characteristics at different scopes:
digraph G5 { edge [fontcolor=black fontsize="10" fontname=Arial]; subgraph referenceNodes { node [style="filled" fontcolor=white fontsize="12" fontname=Arial]; subgraph person { node [fillcolor="blue"]; person_JaneDoe [label="<person>\nJane Doe"]; person_WendyDoe [label="<person>\nWendy Doe"]; person_StacyDoe [label="<person>\nStacy Doe"]; } subgraph profession { node [fillcolor="darkGreen"]; profession_Programmer [label="<profession>\nProgrammer"]; profession_Writer [label="<profession>\nWriter"]; } subgraph vehicle { node [fillcolor="purple"]; vehicle_Subaru [label="<vehicle>\nSubaru"]; vehicle_Ford [label="<vehicle>\nFord"]; } } subgraph dataNodes { node [shape=box style="filled" fillcolor=yellow fontcolor=black fontsize="12" fontname=Arial]; value_age24_1 [label="<integer>\n24";]; value_age24_2 [label="<integer>\n24";]; value_age21_1 [label="<integer>\n21";]; } subgraph JaneDoe { person_JaneDoe -> person_WendyDoe [label=" has sister" dir=both]; person_JaneDoe -> person_StacyDoe [label=" has sister" dir=both]; person_JaneDoe -> profession_Programmer [label=" is a"]; person_JaneDoe -> vehicle_Subaru [label=" owns a"]; person_JaneDoe -> value_age24_1 [label="has age"]; } subgraph WendyDoe { person_WendyDoe -> profession_Writer [label=" is a"]; person_WendyDoe -> vehicle_Subaru [label=" owns a"]; person_WendyDoe -> value_age24_2 [label="has age"]; person_WendyDoe -> person_StacyDoe [label=" has sister" dir=both]; } subgraph StacyDoe { person_StacyDoe -> profession_Writer [label="is a"] person_StacyDoe -> vehicle_Ford [label="owns a"] person_StacyDoe -> value_age21_1 [label="has age"]; } }
Here we have the tale of three sisters, Jane, Wendy, and Stacy Doe. Wendy and Stacy are both writers, while Jane is a programmer. Wendy and Jane drive Subarus, while Stacy drives a Ford. Jane and Wendy are both 24, Stacy is 21.
The subgraphs in G5 serve to create logical groupings based upon common type. The hasSister relationship is noteworthy because of its symmetry - there are arrowheads on both sides, set by the dir=both attribute. The has age relationship is also noteworthy in that you have an ellipse (the default) pointing to a box as specified via the shape attribute. There are roughly thirty shapes in the core DOT specification, although different implementations may offer more or fewer. Also notice that atomic values (in the rectangles) are defined with an identifier that is unique per box, (e.g., value_age24_1 vs. value_age24_2. Without this distinction, the arrows for age24 would all point to the same box (Proponents of semantics might argue that they should point to the same box, but this can make for confusing graphs if you have a large number of atomic values).
Note also that, unless otherwise specified, DOT will usually create curved edges when you have topologies that don't fit neatly into rectangular arrays. DOT, out of the box, attempts to optimize for tension and legibility, though it is possible to change how nodes and edges are laid out that are fodder for a more advanced article. Other capabilities not covered here (and likely will be covered in a subsequent article) include the use of images, an understanding of how to utilize shapes, and the incorporation of HTML content.
DOT (via the command line GraphViz library or through node and related libraries) is capable of producing PNG and SVG versions of graphics, and some implementations include support for TIFF, JPEG, Postscript and other outputs. DOT is (with a few exceptions) consumable by both the vis.js and d3.js node libraries, among others. You can also find up to date documentation on both GraphViz and DOT on the GraphViz Documentation page.
It should also be pointed out that, as you gain proficiency with DOT, patterns for utilizing it to build different types of network graphs programmatically should become more and more evident. These will be explored in a subsequent article as well. Regardless, DOT can be used to build not only network graphs but flowcharts and state diagrams as well, making it a useful tool for any data scientist or programmer who would prefer not to have to recreate the wheel for visualizations.
Article first appeared on Data Science Central.
Kurt Cagle is the Community Editor for Data Science Central and the author of The Cagle Report. He lives in Issaquah, Washington, with his family and cat.
Director Physical education & Athletics at Lewis & Clark College
3 年,. ., but b c
Software Engineer with emphasis on performance, continuous improvement and research
3 年Nice article, as usual. ?? Been aware if DOT for many years although I've never used it professionally (part of the ready-to-use-when-needed toolset). ?? I do remember, not long ago, of a clever integration by Pedro Portilha where graphviz was used within a power grid configuration tool to provide an alternative representation (for debug/analysis/documentation) of the network topology. ??