登录查看更多内容

The Covid Models Are Going to Kill Public Faith in Data Science

Ian Eyberg

CEO at NanoVMs

发布日期: 2020年5月18日

When I found out that the new and improved covid simulation model code was on github I was excited because I had heard rumors that it was "bad" and I was expecting to find 2 or 3 things that people were nitpicking on. Engineers will do that.

What I found though was, well, it actually is horrendous. As in criminally insane, horribly bad, OMG - this is what all the lockdowns were based on!? Keep in mind, this is after people have already tried to cover it up with new refactoring.

If you are a software engineer just go look at the code and you didn't even need to read the rest of this article. The article is mostly for non-engineers. You can call it a rant - I call it evidence for what will most definitely become criminal prosecutions.

Where to start - I don't know.

The issue list which is actively being censored has interesting titles:

I saw a comment that the cyclomatic complexity of one function was literally 666. I'm no conspiracy theorist but typically a complexity of anything over 5 or 6 is kinda high. 10 definitely needs to be refactored. 666!? I thought he wast just joking.

Yeh, he wasn't joking... Most of the functions were in the high tens if not hundreds.

So I started scrolling through the code, ctrl-d, ctrl-d, ctl-d, new horrors keep unfolding.

There is no indentation. There is no spacing. There's a billion flags for god knows what reason. It's very clear people were just cutting/pasting large blocks of code together and for that reason alone you can't trust anything in it. Thread safety issues, memory leaks and data corruption are all over the place. Then there is non-determinism with said thread safety issues - the list goes on and on and on.

I could hire a team of monkeys, give them a case of beer each and they'd still probably write better code.

Apparently the original was a single 15,000 line file. No tests. No tests - how would you verify what you wrote is true!??! I'll answer. There is no way in hell anyone could verify this. NO ONE.

One struct has 268 individual LINES OF VARIABLES in it - as in it actually has way more individual ones - I didn't waste the time to count. I can only assume that was one of the 'refactors' and it was probably all global before.

There are plenty of little gems like this:

There is massive memory corruption spread throughout the code and mutable globals in a program that is inherently multi-threaded. There are large sections of commented out code in a different language meaning someone or someones were actively porting it or worse they just ripped it out of another program with no frame of reference for what the code actually does. There are uninitialized reads and random number generator bugs. The absolute horrible performance of the code is in plain view. Look at all that nested matrix multiplications where in many cases it is completely un-necessary.You get very different results given the precise same inputs - as in it IS BROKE BY DEFINITION.

According to the README this bastion of computer science requires the following to run:

Some people say they would have fired someone for gross incompetence on this code a long time ago. Well, when I'm hiring, in about 99% of the cases, I can find some code written by the job applicant. If I found this type of code, we will never bring you into interview.

There is an old software engineer joke that you can measure code quality in the number of wtfs/minute. My wife seems to think that the uncontrollable hyena laughter might be a decent measure as well.

There's also evidence of machine translation - you can see that with the crazy amount of gotos, meaningless variable names and the magic numbers.

On that note - the number of magic numbers in this codebase is absolutely insane! I'm not even talking about the magic numbers that they chose to re-use from other cited papers as inputs. I'm talking about the large amount of numbers that have no references to anything and you just have to guess what the hell they mean.

This is an absolute disgrace and a complete perversion of the profession.

I think the larger issue at play here, never mind the economic terrorism that ensued, is that no one is ever going to trust academically produced data modeling ever again. I think we are looking at a long winter for these models to be trusted in the public again. Just to be clear, this is a warning to everyone doing climate research. You need to engage professional engineers early and often otherwise you will face the same fate.

You don't even have to read the code - reading some of the github issues are damning enough.

The comment section in this ticket alone is completely nuts:

I'm pasting them here as the censors have already been busy at work and will undoubtedly try to clean/delete these comments later. Keep in mind we still don't have access to the original code - these comments were AFTER significant refactoring.

An engineer at Google has already done quite the takedown of this codebase.

They chose anonymity and while I completely understand the ostracization that they face for telling the truth, especially at a company like Google, more of us need to stand up against this scientific graft. It reflects extremely poorly on everyone in our industry.

There's always been people talking about ethics in data and computer science for a long time. Some people think the time has come that we probably need to embrace some form of a bar exam or engineering license. I'd agree, however, this was a doctor who apparently teamed up with a group of undergrad and grad students over a decade so I'm not convinced that certification is actually going to help the issue.

Politicians and the media that used this should be held culpable. How many times have you heard some health official (with no medical degree by the way) or some politician say we will "use facts and science".

There is no science here! There is no data here! Absolutely none of this "code" can be trusted. Instead of getting on tv and blustering about they could have easily hired an independent firm to verify this mess and they didn't.

There needs to be a congressional inquiry into this. Anyone in the Beltway that is reading this - PLEASE SHOUT LOUDLY. This is a travesty.

There is going to be a hailstorm of lawsuits over this.

The Covid Models Are Going to Kill Public Faith in Data Science

Ian Eyberg

CEO at NanoVMs

更多精彩文章

社区洞察

其他会员也浏览了

Using Directed Acyclic Graphs in Airflow to Automate Datapipelines.

Fuzzy snapshot testing with jq and diff

?? Day 11: Navigating the Depths of Data Structures and Algorithms for Data Science!

Time Series Vectors in Neo4j

Efficient Point Data Extraction from Zarr Datasets with FastAPI, Dask, and Xarray

Unveiling the Power of Advanced Algo Trading Strategies

Implicit type casting is an easy way to shoot yourself in the foot

Party Buzz Kill: modifying data

Beating JSON performance using Protocol Buffers

After 900 leetcode problems here is what I learned

IoT Devices are Not Pieces of Fruit or Cans Of Paint

2022年10月20日

Can You Really Track Individuals via Public Data?

2022年5月9日

A Technical Due Diligence of WASM

2021年2月3日

Unikernel Cronjobs

2021年1月4日

Containers are Not the Future

2020年3月17日

Kubernetes is in Hospice

2019年7月8日

The Tides of Compute Are Changing

2016年12月6日

Open Source is Marketing

2016年8月3日

Go Will Dominate the Next Decade

2015年9月22日

The Death of Linux

2015年6月23日