Data Visualization: Simplifying complex tree diagrams via grammar induction
A set of approaches for summarizing complex trees into a more easily presentable and readable form.
Background
You have some important data and you’re eager to share it with thinkers and decision-makers. Like a lot of important data, your data is hierarchical, structured like a tree — maybe it’s nested investment portfolios, or supply chain information.
It would be nice if this data could be easily presented on a slide, but the tree is often too big and awkward, with too many nodes; few people can get much usable information from looking at a tree diagram that might have hundreds of tiny nodes. So, you need to simplify that tree data in order to produce a good visualization.
This is a very common scenario.
You could create a hand-made infographic, of course, but that introduces bias and opinion and it doesn’t scale. The challenge, then, is to algorithmically reduce the complexity of the tree, such that it can be put on a slide and understood quickly, while retaining as much as possible of its structure.
This is a difficult and open-ended data visualization problem, one that resides on the very boundary of art and science, which is always an interesting place to work. Happily there are several very effective algorithmic approaches for us to consider. Something I find especially interesting is that grammar induction algorithms, usually associated with data compression, can be extremely useful in this area of data visualization.
In this article, I’ll:
As ever, code is available on request.
Example data
Our example data for this article needs to be some tree-like data from the real world that is:
Why should the example data be boring? Well, if the data is interesting, we’ll be tempted to use our knowledge of the meaning of the data to simplify the tree — whereas for a purely algorithmic approach we can’t do that.
Our example tree, then, will be ugly and boring; it is a somewhat anonymized subset of the parking revenue of a local government unit in the UK. Each node represents some grouping of parking revenue sources, with some amount of total revenue (represented by node size). The full tree is like this:
Fig. 1: Original tree data; difficult to put on a slide
As it stands, this tree is too complex to put on a slide and show to decision-makers. We will now look at algorithmic approaches to simplifying it.
Simple algorithms
First, I’d like to look at some of the more straightforward — but still useful — algorithms we can employ.
Node-filtering algorithms
Perhaps the most naive approach to tree simplification is to simply filter out some of the nodes based on a rule. Let’s see what our parking revenues tree looks like when we filter out all nodes with a revenue lower than a particular threshold.
Fig. 2: Node filtering applied
This isn’t very useful at all. It loses lots of important structure; the fact that apparently parking is divided up into the three towns of Framley, Wickton and Alderley is lost, and worse yet, entire categories of revenue have vanished with no warning to the decision-maker that they were ever there. Perhaps if we set a more generous threshold and remove only the very small nodes, the result will be better.
Fig. 3: Node filtering with a lower threshold
This is a very poor result, demonstrating a common problems with algorithmic data simplification — what’s kept is not always what’s important. The simplified tree has the same problems as our first attempt, but now it retains strange aspects of the original tree — such as those long strings of one-child nodes — that probably obscure the information content even further.
Node-filtering algorithms are useful when simplifying very large graph (see this paper for an example), but they aren’t necessarily a good fit for visually summing up a (comparatively) small tree.
Depth-limiting algorithms
Another basic approach to tree simplification is to reduce the depth of the tree, such that no node can be more than n nodes from the root. In theory, we’d expect this to often yield fairly good results, because it should preserve nodes that represent large aggregations of data, and remove their less-important descendants. Let’s try limiting the depth to 5 nodes.
Fig. 4: Tree depth reduced to 5
This isn’t quite as awful as the simple node-filtering algorithm but it has major flaws. For one thing, it compresses the tree so that all aspects of it appear to have the same depth — hiding the fact that apparently some types of parking revenue are more complex than others. Another problem is that it gives the illusion that ‘Framley’ is in some way the same ‘kind’ of node as ‘outstanding’, by presenting them both as leaf nodes. It also retains lots of detail under ‘Fines’, just because ‘Fines’ happened to already have a lot of nodes at depth 5.
Note that there are several less naive ways to limit tree depth; two popular ones are:
Breadth-limiting algorithms
Now we try a somewhat more computationally challenging set of approaches. There are various breadth-limiting algorithms, which generally rely on two steps:
For example, let’s try a breadth-limiting approach in which sibling nodes representing less revenue are combined, into an ‘Other’ node, until no node has more than two siblings.
Fig. 5: Breadth limiting algorithm restricts maximum number of children a node can have
That worked much, much better than the filtering and depth-limiting approaches! For the first time, something of the shape of the tree has been preserved (never mind that the visual spacing isn’t very good — that’s just a layout issue). Perhaps the result would be even better if we set harsher parameters, forcing more nodes to be combined into ‘Other’.
Fig. 6: Breadth-limiting algorithm with a threshold of just 2 children per parent
Well, never mind. But the breadth-limiting algorithm with moderate settings was certainly effective.
Topology-preserving algorithms
When simplifying any graph, it’s possible to preserve the graph’s overall topology by removing any order-2 nodes — i.e. nodes that have exactly 2 connections. That’s only one kind of topology-preserving simplification algorithm, but it’s a powerful one. When applied to trees, it comes down to removing all nodes that have exactly 1 child node, preserving the child node and joining it to its grandparent.
This is called ‘singleton removal’, it’s a very simple and widely-useful algorithm, and in our case the result is this:
领英推荐
Fig. 7: Singleton removal
This hasn’t affected most of the tree but it’s really helped with those long ‘strings’ that were there in the original tree and which seemed to convey very little useful information. By itself it hasn’t produced a useful tree, but perhaps it will shine in combination with other algorithms.
Stacking algorithms
All tree simplification algorithms can be stacked — applied sequentially, hopefully improving the visualization each time. The order in which they are applied is extremely important. Applying successive simplifications to a tree is rather like applying successive lossy compression algorithms to an image; the details of how it’s done can determine whether the useful information is kept and the noise thrown away… or vice versa.
In this example, the tree has been subjected to first a minimum amount filter on each node, and then a depth-limiting step:
Fig. 8: Applying node filtering followed by depth limiting
This isn’t perfect but it’s much better than what either approach produced by itself. Some algorithms (such as singleton removal) are very often best performed as a final step; some work better as a first step.
Complex algorithms
We have looked at various fairly simple and intuitive approaches to tree simplification and seen how they can be combined. These approaches are useful but one thing they have in common, which limits their power, is that they do not take the whole tree into account. Each one looks at a specific node, or at a set of siblings, and decides whether to perform a node removal or a node combining operation. We could achieve much better results, even on our dull parking revenue tree, by using algorithms that take the whole structure of the tree into account.
Grammar induction algorithms are one family of algorithms that can achieve this.
What is grammar induction?
Grammar induction is the process of deriving a grammar from a set of observations such that the grammar can then, in turn, produce the observations.
If this doesn’t sound useful, consider that it’s what you do when you compress a file; the algorithm in your compression program is generating a grammar (in some sense) which can then be applied to some small amount of starting data to re-create the original file.
Grammar induction also has many other uses. The grammar can be interpreted as a catalog of what the important (i.e. frequently recurring) parts of the input are; this can allow us to distinguish more-random or more-repetitive areas of data in large unstructured datasets, which has lots of uses in forensics, telecoms, and the military.
Grammar induction over sequential data:?LZW
Perhaps the best-known and most widely used grammar induction algorithm is Lempel-Ziv-Welch, aka LZW, the algorithm used in?.gif files. LZW achieves speed by processing a list of input symbols sequentially, building up a dictionary of commonly-encountered symbol sequences as it goes. This is a very fast way to compress a file, but not that relevant for data visualization, as there’s nothing particularly sequential about out tree data.
Grammar induction over random-access data:?Sequitur
Sequitur is a simple yet brilliant algorithm that operates over a list of input symbols which need not be accessed sequentially. As a result, Sequitur and its derivatives are able to build a tree-like grammar rather than a dictionary; the nodes in this tree represent commonly recurring patterns in the input. Sequitur is dramatically useful for interpreting lists of symbols about which nothing is known.
Grammar induction over a?tree
In our case, though, we start with a tree, so much of the work Sequitur does is not useful to us. In fact, by starting with a tree, we’ve been handed a grammar on a plate; it’s just not a very useful grammar since the grammar tree has one node for every node in the original tree. But we can process that grammar in several ways to shorten the grammar — thus producing a simpler tree that has many of the properties of the original tree. One thing we can do is to identify the important nodes in that grammar, the nodes that would be applied most if we were actually using the grammar to reproduce the original tree. To do that, we’ll be using an algorithm of this form:
A huge amount of parameterization is possible here.
(Note to lazy and curious readers — yes, it would also be possible to serialize the tree as a string, run Sequitur on it, and then further process the Sequitur-generated tree, which for large real-world use cases will usually be much smaller than the original tree. This approach has problems, though, in many real-world use cases, because Sequitur can’t easily do a ‘fuzzy match’, whereas in real life we’d often want to identify close but not completely similar subtrees.)
Applying grammar induction to our example?tree
For our simple example, we’ll use these functions and parameters:
Summary nodes are given names ending in ‘…’, so you can see which ones they are. If we were bothering with proper layout, we’d do something more complicated.
The result is as follows:
Fig. 9: The tree after a single grammar induction step
It hasn’t done anything about those long strings of singleton nodes, but that’s not a problem as we already have a solution for those.
What the grammar function has done successfully is realize that the various roads and zones are all the ‘same kind of thing’ and replace them with placeholders, ignoring the frequently-repeated details about residents, trade, machines and so on. As a result, the tree has fewer nodes, but the important structures are mostly retained.
Essentially, we’ve applied a level of grammar induction based compression to the tree but then (because our aim is to visualize a simplified tree, not to re-create the original tree) we’ve thrown out the grammar itself, i.e. the tree of subtree summaries.
Simplifying the example?tree
Now let’s bring all the techniques together, stacking our simple grammar induction algorithm with some of the other algorithms we’ve discussed. Here’s what I think is a pretty good simplification of the tree, using grammar induction and singleton removal:
Fig. 11: Grammar induction plus singleton removal simplifies the tree quite well
And here’s another interpretation, drastically reducing the node count with a combination of grammar induction, singleton removal, and breadth reduction, to produce a greatly simplified tree that still has much of the shape of the original:
Fig. 12: Adding node filtering simplifies the tree still further
It’s not perfect — and the visuals could certainly do with improvement — but it’s an awful lot more suitable for presenting on a slide than our original tree.
Next Steps
I hope you enjoyed those algorithms. Of course, algorithms aren’t everything; there are many important techniques which you’d use alongside purely topographical algorithms and which this article has not touched on:
Nevertheless, in this article, I’ve presented a catalog of tree simplification approaches, all of which are fully automatic and can be used together to reduce the node count of a tree while preserving its shape, thus helping to visualize the data and make decisions.
I’ve also tried to demonstrate how compression algorithms are one of the many powerful techniques that can help summarize and visualize complex data. Over the last decade, humankind has collected petabyte after petabyte of complex graph-like data, and yet our exploration of techniques for visualizing and understanding that data is only just beginning.
Thank you for reading.
?? Help Experts Grow Their Networks & Develop Strong Sales Pipelines using Digital Marketing, ABM, and Social Selling Strategies| Sales & Business Development ??
2 年Great article ??
Nice article. Been doing a lot of directed and undirected graphdb work over last few years and decent visualization is a recurring issue. I have generally been easing it by using edge labels and properties to help zoom into specific aspects, while using more generalized views to 'eyeball' unexpected clusters/orphans
Director at Expand Research - A BCG Company
2 年I’m having flashbacks… ??
Lead Data Analyst at Argus Media
2 年Very interesting ??
VP of Engineering at Argus Media
2 年Dharmesh Chauhan this is right up your street