Excel: The perfect DAG of a Data Science Tool
Physicists love computing in a way that is very different from Programmers.
We love computers to get results... numbers that illuminate our theories. For the person whose brain speaks physics, the computer is like a telescope. It points at truth.
Programmers, on the other hand, love computers for the power they bring to tell the computer what to do. It is the joy of programming that they love most.
It is all fine. Takes all kinds to make the world go around. However, there are occasions in life where the difference in emphasis can matter. Specifically, are the numbers right?
To a physicist, a good solution is one that works, sells, AND gives the right numbers.
Having spent a lot of time in R&D labs in my early career (ANU, University of Bristol, University of Melbourne, University of Queensland, Defence Aeronautical and Maritime Laboratories), the idea of readily available computing power comes as second nature.
I well remember running physics codes on Digital Equipment Corporation PDP-11 big iron and Harris 300 minicomputers when I graduated in physics and started work on seismic event modelling at the Research School of Earth Sciences all the way back in 1985.
You will not find a physicist, who works in quantum mechanics, who is not an avid user of high end numerical simulations running on massively parallel processing facilities.
Getting work done is all about the number of math operations per second, in a particular paradigm that is known as message passing based parallel computing.
That is not the only one used by physicists but it is the main route to high performance.
Now let us segue to a different phase of my career.
Starting in 1996 I quit the scientific community, not because I did not love R&D, but simply because there was no career path visible and therefore no economic security.
Those who graduated PhD around the period 1985 through 1995 will likely relate strongly to that statement. Many of my fellow physicists went straight into finance.
Since we were numerate, and well-versed in coding, everybody I knew was soon an expert in how to use Excel, the popular spreadsheet system, and so-called SQL Databases for the storage, search and retrieval of the kind of data found in finance - tables of numbers.
A little story... a good friend at that time was a recent PhD graduate in Numerical General Relativity. He could model black holes and stuff... Talented fellow that he is he was very soon running major investment teams and now his own hedge fund in California.
People with mathematics, physics and computer science often wind up in finance.
What about the computing paradigm?
Interestingly enough, it is different from the dominant paradigm of physics, in so far as the typical calculations do not use message passing, but rather something called functional programming, for things like Excel, and relational algebra for SQL.
No decent mathematician ever rose a sweat coding for such systems. They are pretty near pure mathematics in terms of how you think about a problem.
Of course, you will hear many folks complain about Excel, particularly today when those who come from a programming background fail to comprehend the model.
However, at this time, in the age of Big Data, we are now rapidly hurtling backwards in time to the period aorund the late 1980s and early 1990s when I first launched on this tale.
You see... contemporary coders are now struggling with how to do parallel programming well and they have latched onto a particular model called a Directed Acyclic Graph (DAG).
For the benefit of Australian and Kiwi readers I include a pretty nice picture of a DAG.
(It has nothing to do with Sheep)
The way to understand a DAG, is to think of each circle as containing a piece of data, just like a number in a cell in an Excel spreadsheet. Then that cell might be connected with a different cell. If the arrow points from cell A to cell B, then that is the same as when you have an Excel spreadsheet with a cell named "A" and a function inside cell "B" pointing towards cell "A". This is all pretty straightforward for those who know Excel.
However, it is not straightforward for computer scientists.
In particular, you have to figure out how to pick up the DAG and disentangle what to do first in order to fill in all the cells in the spreadsheet. That can get complicated if you have a big spreadsheet with formulas pointing all over the place.
I see anybody who has spent more than five minutes in finance smile!
Oh yeah. Big spreadsheet. Big Hell :-)
Why is that?
Let's think about what usually goes wrong.
You could run out of memory and hang. Yep. Done that.
You could have a huge number of cells and watch the hamster spin. Yep. Done that.
You could experience the dreaded circular reference. Eeek. Yep. Did it. Told off. Why?
It is the last one we need to zero in on.
If you have cell "A" pointing to cell "B" and that then points back at "A" then you have just there and right then discovered the true meaning of a cosmological question.
What comes first? A or B. You see. Chicken and Egg. Which comes first?
How does the computer know what you mean with such self-reference?
Well, it doesn't, so Excel will complain at you with the cryptic circular reference.
This is called division of labour.
You wrote the spreadsheet. You complain at Excel when it is slow.
You wrote the spreadsheet. Excel complains at you when you don't say what you mean.
It all makes sense.
Incidentally, this is why Excel won't allow you to put a so-called subroutine call within the body of a function. That one always mystifies programmers who are not mathematicians.
However, trust me when I say that the Microsoft engineers who cooked up that rule were spot on in how they set up both that rule and the circular reference rule.
Perhaps younger people will find this difficult to understand. When Lotus 1.2.3., the first spreadsheet, and Excel first came out they were truly revolutionary. Furthermore, the value created in business was so huge that money just flew in the door up in Redmond.
Microsoft at that time had the best engineers of their day and it may surprise you to hear that I think they did a thoroughly excellent job in defining the semantics of Excel.
The reason has to do with our friend The DAG.
What the Excel rules ensure is that no little arrow in that DAG loops back on itself. The graph, as mathematicians call it, is strictly acyclic.
When you do have an arrow looping back on itself that is cyclic.
If you habitually code cyclic computational graphs without knowing that you have, do not check your computational results, and don't think on consequences you get chaos.
Physicists know that real well, which is why they do really big computations using the message passing method where they are forced to stop and think before pressing F9.
Quantitative finance types know it real well because they think in spreadsheets.
Mathematicians don't need to know it because if you tell them the problem they can figure it out for themselves fairly quickly.
Programmers? Maybe... the jury is out right now in the World of Big Data.
Programmers really should know this stuff, since they invented most of the terminology and use it to beat children with in first year computer science class.
However, on the evidence of some of the stuff going on in Big Data land, it is quite possible that the present generation of programmers have forgotten the cardinal rules of Excel.
That is very understandable.
When you talk to young programmers today about Excel they are pretty much universal in their derision of the tool and the platform. I daresay many have never used it.
That is a shame.
Excel is a perfect DAG of a Tool for Data Analysis ... and that is a fact!
Little secret. Pssst... some other tools out there are not.
They are not DAGs and that's a bad thing.
Bugger for them.
Let's give Seymour Cray the last word on this topic.
KnowRisk Consulting
6 年Does your DAG bite
No Title at The Company of Man Retired Pathologist
6 年One chunk of simple data from a complex hunk appears to be the easiest to understand, to manipulate and to respond.? Take the data chunk 650; if this is your blood glucose measurement then your Krebs Cycle could be in mortal peril.
Machine Learning, Hyperspectral Imaging, REE Mineral Exploration Consultant, Real Estate Investor
6 年"All this has all happened before and all this will happen again" - Battlestar Galactica
Computational Biologist at Enveda Biosciences
6 年My first programming language was Fortran on CDC 7000-series mainframes back in the 70s.? With a million words of core memory.? Had to code for efficiency when the huge computer in special room had MUCH less memory and cpu performance than the smartphone in my hand right now.? ?
Senior Solutions Architect, Strategic Designer
6 年I’ve been trying all day to find a word to express how Excel does not align with the concept of science. Failed miserably. Excel the most used Data Manipulation tool in the world...