The Covid Models Are Going to Kill Public Faith in Data Science

The Covid Models Are Going to Kill Public Faith in Data Science

When I found out that the new and improved covid simulation model code was on github I was excited because I had heard rumors that it was "bad" and I was expecting to find 2 or 3 things that people were nitpicking on. Engineers will do that.

What I found though was, well, it actually is horrendous. As in criminally insane, horribly bad, OMG - this is what all the lockdowns were based on!? Keep in mind, this is after people have already tried to cover it up with new refactoring.

If you are a software engineer just go look at the code and you didn't even need to read the rest of this article. The article is mostly for non-engineers. You can call it a rant - I call it evidence for what will most definitely become criminal prosecutions.

Where to start - I don't know.

The issue list which is actively being censored has interesting titles:

No alt text provided for this image

I saw a comment that the cyclomatic complexity of one function was literally 666. I'm no conspiracy theorist but typically a complexity of anything over 5 or 6 is kinda high. 10 definitely needs to be refactored. 666!? I thought he wast just joking.

No alt text provided for this image

Yeh, he wasn't joking... Most of the functions were in the high tens if not hundreds.

So I started scrolling through the code, ctrl-d, ctrl-d, ctl-d, new horrors keep unfolding.

There is no indentation. There is no spacing. There's a billion flags for god knows what reason. It's very clear people were just cutting/pasting large blocks of code together and for that reason alone you can't trust anything in it. Thread safety issues, memory leaks and data corruption are all over the place. Then there is non-determinism with said thread safety issues - the list goes on and on and on.


I could hire a team of monkeys, give them a case of beer each and they'd still probably write better code.


Apparently the original was a single 15,000 line file. No tests. No tests - how would you verify what you wrote is true!??! I'll answer. There is no way in hell anyone could verify this. NO ONE.

One struct has 268 individual LINES OF VARIABLES in it - as in it actually has way more individual ones - I didn't waste the time to count. I can only assume that was one of the 'refactors' and it was probably all global before.

There are plenty of little gems like this:

No alt text provided for this image

There is massive memory corruption spread throughout the code and mutable globals in a program that is inherently multi-threaded. There are large sections of commented out code in a different language meaning someone or someones were actively porting it or worse they just ripped it out of another program with no frame of reference for what the code actually does. There are uninitialized reads and random number generator bugs. The absolute horrible performance of the code is in plain view. Look at all that nested matrix multiplications where in many cases it is completely un-necessary.You get very different results given the precise same inputs - as in it IS BROKE BY DEFINITION.

According to the README this bastion of computer science requires the following to run:

No alt text provided for this image

Some people say they would have fired someone for gross incompetence on this code a long time ago. Well, when I'm hiring, in about 99% of the cases, I can find some code written by the job applicant. If I found this type of code, we will never bring you into interview.

There is an old software engineer joke that you can measure code quality in the number of wtfs/minute. My wife seems to think that the uncontrollable hyena laughter might be a decent measure as well.

No alt text provided for this image

There's also evidence of machine translation - you can see that with the crazy amount of gotos, meaningless variable names and the magic numbers.

On that note - the number of magic numbers in this codebase is absolutely insane! I'm not even talking about the magic numbers that they chose to re-use from other cited papers as inputs. I'm talking about the large amount of numbers that have no references to anything and you just have to guess what the hell they mean.

This is an absolute disgrace and a complete perversion of the profession.

I think the larger issue at play here, never mind the economic terrorism that ensued, is that no one is ever going to trust academically produced data modeling ever again. I think we are looking at a long winter for these models to be trusted in the public again. Just to be clear, this is a warning to everyone doing climate research. You need to engage professional engineers early and often otherwise you will face the same fate.

You don't even have to read the code - reading some of the github issues are damning enough.

The comment section in this ticket alone is completely nuts:

I'm pasting them here as the censors have already been busy at work and will undoubtedly try to clean/delete these comments later. Keep in mind we still don't have access to the original code - these comments were AFTER significant refactoring.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

An engineer at Google has already done quite the takedown of this codebase.

They chose anonymity and while I completely understand the ostracization that they face for telling the truth, especially at a company like Google, more of us need to stand up against this scientific graft. It reflects extremely poorly on everyone in our industry.

There's always been people talking about ethics in data and computer science for a long time. Some people think the time has come that we probably need to embrace some form of a bar exam or engineering license. I'd agree, however, this was a doctor who apparently teamed up with a group of undergrad and grad students over a decade so I'm not convinced that certification is actually going to help the issue.

Politicians and the media that used this should be held culpable. How many times have you heard some health official (with no medical degree by the way) or some politician say we will "use facts and science".

There is no science here! There is no data here! Absolutely none of this "code" can be trusted. Instead of getting on tv and blustering about they could have easily hired an independent firm to verify this mess and they didn't.

There needs to be a congressional inquiry into this. Anyone in the Beltway that is reading this - PLEASE SHOUT LOUDLY. This is a travesty.

There is going to be a hailstorm of lawsuits over this.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了