Cowboy Science
Bad Data
It was 1986. My first 'Stability' meeting.
It started pleasantly enough. There were three gorillas-in-suits in the room - the Heads of Stability, Formulation and Analytical. The Head of Stability chaired the meeting, introduced me to everyone and welcomed me to the group. So far, so good.
Then we turned to the first item on the Agenda: The review of stability data from a well-known cardiotonic. The Head of Stability shared a chart showing the degradation of the drug and increases in impurity over time. This is not surprising. Stability studies expose drugs to harsh environmental conditions - storage at high temperatures and high humidity - in order to accelerate degradation. This information is then used to determine the shelf-life under 'normal' conditions for your medicines. He pointed to the fresh six-month data. The values were high: the drug was degrading more quickly than expected. The shelf-life might be shorter than hoped.
At this point, the proverbial hit the fan.
With a roar, the Head of Formulation leapt to his feet. He strode up to the screen glowering at the Head of Analytical.
The Head of Analytical returned the threat-stare: he stalked to the other side of the room.
The pair postured, panting slightly and swaying from one foot to another, on opposite sides of the room,
The Head of Formulation was responsible for producing a stable formulation for the drug.
The Head of Analytical was the one charged with telling him his formulation was not stable.
Metaphorically the two adversaries locked horns in mortal combat.
Meanwhile, everyone else in the room studiously avoided eye contact - staring at the table, checking diaries, or flicking through notes. Keeping their heads down.
This was all new to me.
Suddenly, the Head of Stability broke the deadlock.
He seized on the presence of the new boy. Me.
What did I think?
I looked at the chart.
"I'm new to this" I said. Everyone nodded with agreement. This was common ground - something on which we could all agree.
"But it looks to me like we're seeing some degradation over time?"
I swear someone groaned. The Head of Formulation glowered at me.
"Though there seems to be a fair amount of analytical variability?"
I heard a sharp intake of breath from someone. The Head of Analytical glowered at me.
Brilliant.
My first day and I had made enemies of them both. Way to go.
"What do you normally do when this happens?" I asked.
And that broke the deadlock.
Everyone was able to contribute. They explained to the new boy what they would need to do next. Obviously, there was a question around the analytical method and whether this was likely due to analytical variability. They came up with a million reasons why the new data points might not be real. They talked at length about the likely measurement error. And they came up with a plan to check the data.
I discovered that day that unwelcome news is often delivered in the form of data.
And I learned that people will happily invoke 'bad data' to ignore unwelcome news.
When graduating from Apes in Lab Coats to Gorillas in Suits, scientists take with them a critical mindset that permits them to question everything. They have spent their lives defending things - a thesis, a hypothesis, a position, a viewpoint. At meetings and conferences, they get to practise taking data apart, demolishing the work of others, and presenting alternative interpretations of data.
Data that sit well with their preconceptions are given a free pass. Those that don't are torn apart. It is one form of confirmation bias. So we should not be that surprised to see it clouding their judgement as Gorillas in Suits.
This is not a problem as long as they do not forget this.
But they forget this.
The Bias Blind Spot means that while they may be quick to spot this form of bias in others, they often miss it when challenged about their own behaviour.
Executives are often skilled in challenging unwelcome data. Quick to write off data as 'bad data'.
And this is easily done.
Because data are derived from measurements and measurement is complicated.
Measurement Theory
My journey to the Dark Side began when I chose Psychology as a Minor Option. Until then I was heading for a straight science career. I remember the turning point well. My old professor, the late Richard Bambridge, shambled into the psychology 'laboratory'. He wore his trademark three-piece tweed suit (but with trousers swapped out for jeans) and his 'National Health' spectacles. These were balanced precariously on the bridge of his nose, and carried a running repair with the arms paper-clipped to the frames. From Richard I learned that if you want to learn about measurement theory, then talk to a psychologist.
The softer sciences like psychology can teach us a thing or two about measurement rigour. But for some reason those of us working in the hard sciences assume that our measures are more concrete. Less subject to variability. That the need to understand measurement error is somehow less important when measuring gravitational waves or gene expression.
One of the problems is that we think we know what measurement is.
From an early age we are introduced to simple measurement systems like rulers.
We all know what rulers look like. In school we get to play with rulers. We can measure the length of an object with ease. I can compare my ruler to your ruler: they are about the same length. We learn that this is not an accident. We both know our rulers have been calibrated against a standard length. I can measure the length of a stick, and I get roughly the same value as you do.
But what if the object is a banana?
Well, then it gets a bit more complicated.
Or a worm?
OK, we are going to need some ground rules.
Where does the worm start? Where does it end?
What if the object we are measuring is large?
Like the distance from here to Cairo? We're going to need a bigger ruler.
And what if we are trying to measure something that isn't observable?
Measure what is measurable, and make measurable what is not so. Galileo Galilei
Gene Expression
Take gene expression for example. Gene expression refers to the process by which information from a gene is used to synthesize a functional gene product, usually a protein. Measuring gene expression typically involves quantifying the mRNA (messenger RNA) produced during transcription, which indirectly reflects how active a gene is.
Two things:
Gene Expression is Not Directly Observable. We can't directly observe the number of gene transcripts or proteins being produced in real-time. Instead, we infer gene expression levels through measurements of molecular intermediates like mRNA.
Multiple Biological Processes are Involved. Gene expression involves transcription (DNA to RNA), RNA splicing, translation (RNA to protein), and sometimes post-translational modifications. Measuring only one part, like mRNA levels, provides an indirect estimate of the overall expression rate.
Currently, there are four main ways of measuring gene expression.
Note that gene expression cannot be measured directly because we are measuring either mRNA levels (from transcription) or protein levels (from translation) - both are byproducts of gene activity. We infer the rate of gene expression based on these downstream measurements. The process is complex involving multiple biochemical steps - RNA extraction, amplification, hybridization, or sequencing - each requiring precise control and advanced technology. Gene expression can vary based on conditions, time points, and cell types, adding to the complexity of accurately measuring and interpreting expression levels. There is sampling variability - two tests from the same individual give different results. There is biological variability - two tests from different individuals give even more different results.
Our 'gene expression' levels must be normalized against reference genes or housekeeping genes to account for variability in RNA quality and quantity. There is Noise and Variability in the data. And processes like mRNA degradation or translation efficiency can further complicate the relationship between mRNA levels and protein expression, adding another layer of indirect inference.
Bottom line? It's complicated.
As a result, gene expression studies have encountered several cases where biomarkers thought to be important were later discovered to be artefacts, often due to the mismanagement of batch effects or other sample processing issues.
You do know that most of your putative biomarkers are not real? Don't you?
One of the most infamous is the Duke University scandal where gene expression data were used to develop personalized chemotherapy treatments for lung cancer. However, these gene signatures were later found to be severely flawed due to batch effects and improper statistical handling. These errors led to false claims of biomarkers that could predict cancer outcomes. The study was ultimately retracted, and clinical trials based on this work were halted. The case prompted major institutional reviews, as well as significant discussions about the reproducibility of high-throughput biological data.
The discovery of the HER2 gene as a critical biomarker in breast cancer led to the development of targeted therapies such as trastuzumab (Herceptin). However, studies revealed that improper tissue processing, variability in antibody quality, and differences in testing protocols between labs led to the misclassification of HER2 statuses. False-positive or false-negative HER2 results resulted in patients either being denied effective therapy or receiving unnecessary treatments. This prompted the development of standardized HER2 testing protocols and raised awareness about the effects of technical artefacts on biomarker accuracy.
Many of the reported promising diagnostic biomarkers for various cancers turn out to be artefacts. One famous example involved the study of protein biomarkers for ovarian cancer. The team initially reported promising diagnostic biomarkers; however, it was later discovered that these results were primarily driven by batch effects in the mass spectrometry data rather than real biological differences.
These examples highlight the critical need for rigorous control of technical artefacts and batch effects when analyzing high-dimensional biological data for biomarker discovery. Without such precautions, results can be misleading, potentially leading to incorrect conclusions and clinical applications.
Pretty Lights
So, Dennis, tell me. Why are you droning on about this?
Well, while all this may be as plain as the nose on your face to your jobbing Ape in a Lab Coat, it's well below the pay grade of your average Gorilla in a Suit. But on the basis of these data, these measurements, they have to advance or terminate a project, divert R&D resources to pursue a White Rabbit, or terminate a potentially life-saving line of enquiry.
It's well below the pay grade of your average Gorilla in a Suit.
And that is why the detail is important.
It is also why discovery research tends to follow a familiar pattern.
Firstly, a new way of measuring things come along. On the basis of that several novel 'discoveries' are identified. They are patented and/or published. This generates excitement, publications, kudos, and further funding. Pretty Lights. Everyone switches to the new method. Further novel 'discoveries' are identified. They may not be big on detail, but the horizon-scanning Gorillas in Suits see what is happening. It becomes impossible to support grant proposals or funding requests without adoption of the new method.
The world of science is swept aside by the latest, next, big, 'Next Big Thing'.
Meanwhile, in the background, the Apes in Lab Coats are beginning to express uncertainty about the certainties generated by the recent technology. The statisticians evaluate the novel methods developed to handle these data only to learn that they don't work. That the operating characteristics make them at best useless or, even worse, worse than useless.
Nobody listens.
The operating characteristics make them at best useless or, even worse, worse than useless.
Fast-forward ten years. Everyone accepts that the method was pants. We can either spend our careers correcting the last ten years of junk science to find out what is really happening. Or we can latch on to the latest way of measuring things and start the whole cycle again.
There is no glory in the first of these.
Most choose to follow the next, big, 'Next Big Thing'.
I have said it before.
This is not the rigour we expect of World Class Science.
It isn't Junk Science.
It isn't Cargo Cult Science.?
This is Cowboy Science.
Cowboy Science is characterized by pro-innovation bias. Scientists may be so eager to explore new technologies, techniques, or hypotheses that they bypass critical scrutiny and robust testing in favour of quick results. They may focus on publishing exciting findings or riding the wave of novelty, even if the science isn’t fully baked or if they’re "hanging on" to an unstable solution.
Cowboy Science embraces other forms of bias:
"In Cowboy Science, no project ever truly fails—it's just a success that hasn’t been redefined yet."
In Cowboy Science, instead of acknowledging failure, scientists reframe unexpected results or incomplete findings to maintain the illusion of success, keeping the project alive through rationalization.
Definitions
World Class Science /noun/
Scientific practice characterized by world class scientific behaviours including the balanced evaluation of data, merciless examination of alternative explanations, scientific rigour in experimental studies, a profound understanding of variability and experimental design leading to reproducible scientific findings.
Junk Science /noun/
Untested or unproven theories presented as scientific fact.
Cargo Cult Science /noun/
Practices which have the appearance of being scientific, but do not actually follow the scientific method.
Cowboy Science /noun/
Definition: A reckless or over-enthusiastic approach to scientific research, characterized by jumping onto untested ideas or methods without sufficient scrutiny or validation, and later rationalizing unexpected results as if they were intended. The term criticizes pro-innovation bias where novelty is prioritized over accuracy and rigorous methodology, leading to hasty conclusions or the defence of unstable solutions.
Example in a sentence: "In their rush to publish novel findings, the research team engaged in cowboy science, embracing a high-risk method and adjusting their narrative to fit wherever the data led them."
Etymology: The term draws an analogy to the pioneers of the western frontier. A cowboy in a rodeo jumps on a wild horse, only to be thrown around unpredictably, yet claims they intended to land where they did all along. In science, this mirrors researchers grappling with unanticipated data but defending their conclusions as deliberate.
Usage: Typically used in a critical context to describe scientific practices that exhibit overconfidence in innovation without thorough validation or due consideration of the unpredictability in experimental results.
Product Manager, PhD, Statistics & AI Implementation | Design of Experiments | Digitalization | Machine Learning
1 周Wonderful writing as always! I see this pattern of behavior in many early R&D projects in industry. Something that often takes place in these projects is co-development of the measurement method and the process. That is, a measurement method is designed, built, tested and adjusted using *unknown* samples. This is super dangerous to do, because it introduces a huge bias towards believing the measurement when it says what you want to hear, and tamper with it when it gives you bad news. One quick way to start bringing this to light is to ask to be shown the calibration curve for the measurement method. This can result in you being shown many "interesting" plots, whether it be no replicates of measurements, calibrating on high or low signals only, R^2 of 1.0 (because they fiddled with the values), or the simple admission that no calibration exists... All of the above, a symptom of being in a bit of a rush. Cowboy Science indeed.
QC Manager - Data and Analytics hos Aker BioMarine
1 周I enjoyed reading this, and I'm also responsible for managing my employer's stability program. Jumping to conclusions is easy. We have to constantly dare and try disproving ourselves until truth emerges. Luckily, I had a very open minded PhD supervisor supporting, urging, pushing for critical thinking on every level, even at the risk of disproving one's own publications. That space, as valuable as it is, is only rarely provided. Unfortunate.