Excel: The perfect DAG of a Data Science Tool

Excel: The perfect DAG of a Data Science Tool

Physicists love computing in a way that is very different from Programmers.

We love computers to get results... numbers that illuminate our theories. For the person whose brain speaks physics, the computer is like a telescope. It points at truth.

Programmers, on the other hand, love computers for the power they bring to tell the computer what to do. It is the joy of programming that they love most.

It is all fine. Takes all kinds to make the world go around. However, there are occasions in life where the difference in emphasis can matter. Specifically, are the numbers right?

To a physicist, a good solution is one that works, sells, AND gives the right numbers.

Having spent a lot of time in R&D labs in my early career (ANU, University of Bristol, University of Melbourne, University of Queensland, Defence Aeronautical and Maritime Laboratories), the idea of readily available computing power comes as second nature.

I well remember running physics codes on Digital Equipment Corporation PDP-11 big iron and Harris 300 minicomputers when I graduated in physics and started work on seismic event modelling at the Research School of Earth Sciences all the way back in 1985.

You will not find a physicist, who works in quantum mechanics, who is not an avid user of high end numerical simulations running on massively parallel processing facilities.

Getting work done is all about the number of math operations per second, in a particular paradigm that is known as message passing based parallel computing.

That is not the only one used by physicists but it is the main route to high performance.

Now let us segue to a different phase of my career.

Starting in 1996 I quit the scientific community, not because I did not love R&D, but simply because there was no career path visible and therefore no economic security.

Those who graduated PhD around the period 1985 through 1995 will likely relate strongly to that statement. Many of my fellow physicists went straight into finance.

Since we were numerate, and well-versed in coding, everybody I knew was soon an expert in how to use Excel, the popular spreadsheet system, and so-called SQL Databases for the storage, search and retrieval of the kind of data found in finance - tables of numbers.

A little story... a good friend at that time was a recent PhD graduate in Numerical General Relativity. He could model black holes and stuff... Talented fellow that he is he was very soon running major investment teams and now his own hedge fund in California.

People with mathematics, physics and computer science often wind up in finance.

What about the computing paradigm?

Interestingly enough, it is different from the dominant paradigm of physics, in so far as the typical calculations do not use message passing, but rather something called functional programming, for things like Excel, and relational algebra for SQL.

No decent mathematician ever rose a sweat coding for such systems. They are pretty near pure mathematics in terms of how you think about a problem.

Of course, you will hear many folks complain about Excel, particularly today when those who come from a programming background fail to comprehend the model.

However, at this time, in the age of Big Data, we are now rapidly hurtling backwards in time to the period aorund the late 1980s and early 1990s when I first launched on this tale.

You see... contemporary coders are now struggling with how to do parallel programming well and they have latched onto a particular model called a Directed Acyclic Graph (DAG).

For the benefit of Australian and Kiwi readers I include a pretty nice picture of a DAG.

(It has nothing to do with Sheep)

The way to understand a DAG, is to think of each circle as containing a piece of data, just like a number in a cell in an Excel spreadsheet. Then that cell might be connected with a different cell. If the arrow points from cell A to cell B, then that is the same as when you have an Excel spreadsheet with a cell named "A" and a function inside cell "B" pointing towards cell "A". This is all pretty straightforward for those who know Excel.

However, it is not straightforward for computer scientists.

In particular, you have to figure out how to pick up the DAG and disentangle what to do first in order to fill in all the cells in the spreadsheet. That can get complicated if you have a big spreadsheet with formulas pointing all over the place.

I see anybody who has spent more than five minutes in finance smile!

Oh yeah. Big spreadsheet. Big Hell :-)

Why is that?

Let's think about what usually goes wrong.

You could run out of memory and hang. Yep. Done that.

You could have a huge number of cells and watch the hamster spin. Yep. Done that.

You could experience the dreaded circular reference. Eeek. Yep. Did it. Told off. Why?

It is the last one we need to zero in on.

If you have cell "A" pointing to cell "B" and that then points back at "A" then you have just there and right then discovered the true meaning of a cosmological question.

What comes first? A or B. You see. Chicken and Egg. Which comes first?

How does the computer know what you mean with such self-reference?

Well, it doesn't, so Excel will complain at you with the cryptic circular reference.

This is called division of labour.

You wrote the spreadsheet. You complain at Excel when it is slow.

You wrote the spreadsheet. Excel complains at you when you don't say what you mean.

It all makes sense.

Incidentally, this is why Excel won't allow you to put a so-called subroutine call within the body of a function. That one always mystifies programmers who are not mathematicians.

However, trust me when I say that the Microsoft engineers who cooked up that rule were spot on in how they set up both that rule and the circular reference rule.

Perhaps younger people will find this difficult to understand. When Lotus 1.2.3., the first spreadsheet, and Excel first came out they were truly revolutionary. Furthermore, the value created in business was so huge that money just flew in the door up in Redmond.

Microsoft at that time had the best engineers of their day and it may surprise you to hear that I think they did a thoroughly excellent job in defining the semantics of Excel.

The reason has to do with our friend The DAG.

What the Excel rules ensure is that no little arrow in that DAG loops back on itself. The graph, as mathematicians call it, is strictly acyclic.

When you do have an arrow looping back on itself that is cyclic.

If you habitually code cyclic computational graphs without knowing that you have, do not check your computational results, and don't think on consequences you get chaos.

Physicists know that real well, which is why they do really big computations using the message passing method where they are forced to stop and think before pressing F9.

Quantitative finance types know it real well because they think in spreadsheets.

Mathematicians don't need to know it because if you tell them the problem they can figure it out for themselves fairly quickly.

Programmers? Maybe... the jury is out right now in the World of Big Data.

Programmers really should know this stuff, since they invented most of the terminology and use it to beat children with in first year computer science class.

However, on the evidence of some of the stuff going on in Big Data land, it is quite possible that the present generation of programmers have forgotten the cardinal rules of Excel.

That is very understandable.

When you talk to young programmers today about Excel they are pretty much universal in their derision of the tool and the platform. I daresay many have never used it.

That is a shame.

Excel is a perfect DAG of a Tool for Data Analysis ... and that is a fact!

Little secret. Pssst... some other tools out there are not.

They are not DAGs and that's a bad thing.

Bugger for them.

Let's give Seymour Cray the last word on this topic.


Peter Urbani

KnowRisk Consulting

6 年

Does your DAG bite

Dr. Curtis J. Tinsley

No Title at The Company of Man Retired Pathologist

6 年

One chunk of simple data from a complex hunk appears to be the easiest to understand, to manipulate and to respond.? Take the data chunk 650; if this is your blood glucose measurement then your Krebs Cycle could be in mortal peril.

回复
Brian S. Penn, PhD

Machine Learning, Hyperspectral Imaging, REE Mineral Exploration Consultant, Real Estate Investor

6 年

"All this has all happened before and all this will happen again" - Battlestar Galactica

Matt Healy

Computational Biologist at Enveda Biosciences

6 年

My first programming language was Fortran on CDC 7000-series mainframes back in the 70s.? With a million words of core memory.? Had to code for efficiency when the huge computer in special room had MUCH less memory and cpu performance than the smartphone in my hand right now.? ?

David Christie

Senior Solutions Architect, Strategic Designer

6 年

I’ve been trying all day to find a word to express how Excel does not align with the concept of science. Failed miserably. Excel the most used Data Manipulation tool in the world...

要查看或添加评论,请登录

Kingsley J.的更多文章

  • ゴッドハンドニッパーへの賛歌

    ゴッドハンドニッパーへの賛歌

    これは私のニッパー、多くのものがある しかし、この一つは私のもの、私の頼りになる道具 私はそれを使いこなさなければならない、私のガンプラキットを作るために ナブを切り取り、白いストレスマークなしで滑らかな部品を作るために…

  • Ode to a GodHand Nipper

    Ode to a GodHand Nipper

    This is my nipper, there are many like it But this one's mine, my trusty tool, my bit I must master it as I build my…

  • The Origins of Neologistic Nihilism

    The Origins of Neologistic Nihilism

    Neologistic Nihilism is a radical philosophy that emerged from the introspective musings of a young man from East…

    2 条评论
  • Time to Pay the Dudelsack Pfeiffer

    Time to Pay the Dudelsack Pfeiffer

    In the summer solstice of 1998, Herr Yankovic was crowned the winner of the Medal of St Gallen for the third…

  • The Peculiar Tale of Madonna and Moon Dust

    The Peculiar Tale of Madonna and Moon Dust

    Good afternoon, I am Philip Ball, a science correspondent for Nature, and today we will be discussing the fascinating…

    5 条评论
  • Authentic Stories for Investment

    Authentic Stories for Investment

    There is a paradox which lies at the heart of the investment management industry. On the one hand, our industry demands…

  • The ViralMath.org Mission

    The ViralMath.org Mission

    Welcome to ViralMath.org with this post describing our mission to help fight the deadly COVID-19 pandemic through…

    1 条评论
  • Gompertz Growth and COVID-19

    Gompertz Growth and COVID-19

    Okay..

  • R&D Publishing Post Peer-Review

    R&D Publishing Post Peer-Review

    Having played around a bit with blogging and other forms of digital communications, I have always come away a little…

    2 条评论
  • Bayes' Rule and Terrorist Videos

    Bayes' Rule and Terrorist Videos

    In the wake of the Christchurch massacre, there is considerable discussion of the merits, or otherwise, of Artificial…

    2 条评论

社区洞察

其他会员也浏览了