登录查看更多内容

Medical Machine Learning's Sharing Problem

Devansh Devansh

Chocolate Milk Cult Leader| Machine Learning Engineer| Writer | AI Researcher| | Computational Math, Data Science, Software Engineering, Computer Science

发布日期: 2021年7月16日

What is causing the lack of sharing of data/details in Medical Machine Learning and What can be done about it?

As I was reading the paper, "SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation", something stood out to me. Here we have an excerpt from the paper:

According to the analyzed medical image segmentation studies in [33], 30% have used private datasets. As a result, the studies are not reproducible. Researchers must keep datasets private due to medical data sharing restrictions.

This seemed like a pretty big deal to me. As someone involved in Machine Learning Research, the fact that 30% of the Datasets are private is kind of worrying. When results are not reproducible this causes issues. Not only does this mean that verifying the information in the paper/research becomes impossible, it also restricts new research into that area. Machine Learning progresses incrementally. People often try to improve upon the previous research by analyzing the procedure and trying to tweak steps/add protocols. This could be things such as swapping data augmentation protocols, trying different networks, etc. When the datasets are not available to the public, other people can't analyze the data and results in-depth, making them unable to contribute. And of course, not sharing your dataset means that researchers won't be able to catch nuances (such as a biased dataset) that you might miss.

This has become such a huge issue, that Wikipedia has an entire page dedicated to this, aptly named the Replication Crisis. In fact, there is a whole subgroup of (meta)scientists trying to solve this issue. Through this article, I will talk about some ways that we can tackle this issue in the Medical Machine Learning Space and the AI space in general. This will touch upon some of the issues that AI research faces when it comes to replication. If you are someone involved in these areas, I would love to hear your thoughts. Please do let me know, either in the comments or by reaching out to me directly.

What types of issues do we have currently with replication?

Following is a (non-exhaustive) list of the kinds of issues that make replication difficult.

Private Datasets (or other things): Regulations, confidential data, or any other reason can make it so that the dataset that is shared used by a group of researchers is unable to share their data with their research. I will also group into these, practices such as not sharing your code or some other detail of the procedure.
Costs: In some cases, it is just impractical to replicate the findings of another group, because of the costs aspect. Let's imagine the GPT3 team came up with some findings using their model. And they were kind enough to put the code and all the data they used online. No way I can replicate their findings, on my puny 4 years old Dell Laptop.
Misaligned Incentives: In certain cases, the researchers aren't really incentivized to share everything. This often happens in the case of private researchers, who share their findings without really giving the details required to replicate the results. To refer to the somewhat notable case of Google Health and Breast Cancer research, they weren't really interested in publishing quality research. They were more interested in letting the world know that they had cool tech. They didn't publish many details in their Journal Entry. Suffice to say, Scientists were not happy
Lack of interest: Similar to the last point, there just isn't as much reward in replicating someone else's research. This means that the researchers often just don't replicate the findings, even if they have the means to. In some cases, it might be delegated to students. This means that there is a lot of missed opportunity. An expert replicating a study might be able to catch things a student getting into ML would miss.

So why is this an issue?

When studies can't be replicated, it can cause all kinds of issues. For example, a research team might be using a biased dataset in some way. They publish their results using their dataset. The details are kept private and all is merry until the solution is introduced to the real world. And suddenly, we are exposed to this failure. Think I'm making it up? Think of Apple Face Recognition failing to recognize Asian Faces. Or this example of an AI mistaking a referee's bald head for a football. Yannich does a fantastic job of breaking down how this could be harmful. Had the research/procedure been made public, somebody could have caught the dataset biases.

Such issues happen more frequently than you'd think. And having open-source reproducible research is a good way to combat it.

What can be done?

Here are some of the ideas I liked when researching this topic. If you have any thoughts on them or have any ideas you like, be sure to share them.

Centralization of regulations and procedures

We need an international standard for data sharing and reproducibility. An international body that can set rules and regulations in order to allow data sharing while protecting the privacy of patients etc. Having an internationally agreed-upon standard will allow researchers to share and use datasets without worrying about violating any privacy concerns. A team in India will be able to use Norwegian datasets to replicate and improve upon findings without both parties having to worry about the red tape on either side.

Having international standards for reproducibility will also allow teams to have a clear understanding of the details they should provide in their work. Some people are pushing for this already. Joelle Pineau of Facebook AI (and McGill University) introduced a fantastic checklist for Machine Learning Reproducibility. Check it out right here. If you're interested in reading more about this, check out Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program). Another great initiative is the ?Papers with Code?project, set up by AI researcher Robert Stojnic when he was at the University of Cambridge. Such initiatives are boosting the reproducibility of studies. If you read the report, you will see that ever since sharing the checklist, the number of submissions to NeurIPS went from lower than 50% to 75%.

Independent Auditors

An issue that private entities might have is the issue of proprietary tech. For example, they might use external tools. As somebody who developed a tool that helps researchers and engineers in their Machine Learning Work, I wouldn't want all my code to be public, since that means people will no longer need my tool. Private research entities have a lot of IP to protect. Therefore, they might not be willing to share enough details about their project to allow it to be reproduced.

领英推荐

Artificial Intelligence In Healthcare: 10 Medical…

Bertalan Meskó, MD, PhD 3 年前

What's Next in Healthcare AI: A Forward-Looking…

Sehul V. 2 个月前

AI in the Healthcare and Medical Sector

Chris Hobbick 1 个月前

To address this, I believe that we need an established group of auditors. Private organizations allow them to check their results, replicate their work and go over the details. This private group acts as a representative for the general AI research community. This way, companies can protect their confidential IP while also reducing the impact of keeping their research details private.

Better Incentives of Reproduction

Lastly, I believe that we need more incentives for replicating other people's research. This is something that is being pushed for. The Reproducibility Challenge is a great example. It incentivizes high-quality replication. By having more such incentives, we will be able to encourage more researchers to replicate the existing research in order to verify findings and ultimately identify future improvements in the process.

We also need to improve incentives for people sharing reproducible research. If research teams have a reason to share their procedures and details (instead of just sharing their details) reproducibility will naturally improve.

An interesting idea I came across was from Dr. Benjamin Haibe-Kains, from the University of Toronto. Haibe-Kains would like to see journals split what they publish into separate branches: reproducible studies on one and tech showcases on the other. This would allow us to distinguish between the two kinds of studies. Instead of Google piggybacking off prestigious journals to show off their work, they will have to either publish more details or share their results in the showcase branch.

Closing

Hopefully, this article sparks your interest in the Replication Crisis. As you can see, it's important for Machine Learning Research, that the research published be transparent. This way people can replicate studies to verify findings, identify areas of improvement, and ultimately help add to the discourse in the field.

This article is by no means the final say in the topic of Replication and Detail Sharing. It was meant as an introduction to the topic. I would suggest looking more into this topic. As more and more research is done by giant tech companies, and private entities, this area will become a very very important topic in the future. Make sure to learn about this, and keep up with the developments. And be sure to share anything interesting with me :). The beauty of the internet is that we can all learn from each other

Reach Out to Me

If the article got you interested in reaching out to me, then this section is for you. You can reach out to me on any of the platforms, or check out any of my other content. If you’d like to discuss tutoring, text me on LinkedIn, IG, or Twitter. I help people with Machine Learning, AI, Math, Computer Science, and Coding Interviews.

If you’d like to support my work, using my free Robinhood referral link. We both get a free stock, and there is no risk to you. So not using it is just losing free money.

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

My Substack: https://devanshacc.substack.com/

Live conversations at twitch here: https://rb.gy/zlhk9y

Get a free stock on Robinhood: https://join.robinhood.com/fnud75

Pankaj Sudan

Founder - The Plenum School

3 年

This is a very thought-provoking article. Have you considered sharing/tagging the likes of all Diagnostic companies like GE/J & J, Zimmer /Philips/DSM/ Seimens/ Stryker. They may want to know more.

要查看或添加评论，请登录

Devansh Devansh的更多文章

Most important AI Developments to watch for Founders and Investors: March 2025

2025年3月25日

Most important AI Developments to watch for Founders and Investors: March 2025

This market research was originally published here. Sign up for free to join over 150K founders, investors, and tech…

3 条评论
5 AI Studies Every Builder Must Know (but probably doesn’t)

2025年3月15日

5 AI Studies Every Builder Must Know (but probably doesn’t)

Thank you Tampa for all the love. I couldn’t spend enough to meet all of you who reached out, but I really appreciate…

2 条评论
What you should know in AI, Software, Business, and Tech- 3/4/2025

2025年3月6日

What you should know in AI, Software, Business, and Tech- 3/4/2025

A lot of people reach out to me for reading recommendations. I figured I’d start sharing whatever AI…

2 条评论
The Next AI Revolution: Why Diffusion-Based Language (like Mercury LLM) Models Are a Big Deal

2025年2月28日

The Next AI Revolution: Why Diffusion-Based Language (like Mercury LLM) Models Are a Big Deal

This breakdown was originally published here. To ensure that you get such high-quality articles delivered straight to…

2 条评论
Why Research is Expensive

2025年2月21日

Why Research is Expensive

When most people think about the costs of research, they focus on the obvious: expensive equipment, competitive…

1 条评论
What you should know in AI, Software, Business, and Tech- 2/19/2025

2025年2月20日

What you should know in AI, Software, Business, and Tech- 2/19/2025

Before we get into the AI stuff, I saw "Capt America, Brave New World" today. Didn't really go in with many…

5 条评论
How Google uses AI to save Millions of Dollars on Computing Chip Design

2025年2月12日

How Google uses AI to save Millions of Dollars on Computing Chip Design

Following is an excerpt from my article- "AI x Computing Chips: How to Use Artificial Intelligence to Design Better…

1 条评论
The Chinese Philosophy to Building Flexible AI

2025年1月31日

The Chinese Philosophy to Building Flexible AI

Why you should read “The Tao Te Ching” by Lao Tzu, AI Edition This was originally published in my Free Newsletter-…

8 条评论
How to develop the most Important Skill Required for AI

2025年1月27日

How to develop the most Important Skill Required for AI

How you should learn Math for to get Good at AI The following is an excepr from my article, “What Math do you need to…

5 条评论
Why Deepseek is sharing their R1 AI Model publically

2025年1月24日

Why Deepseek is sharing their R1 AI Model publically

Understanding the misunderstood business of Open Source Software Deepseek's R1 model, which is competitive with…

15 条评论

See all articles

Medical Machine Learning's Sharing Problem

Devansh Devansh

Chocolate Milk Cult Leader| Machine Learning Engineer| Writer | AI Researcher| | Computational Math, Data Science, Software Engineering, Computer Science

What is causing the lack of sharing of data/details in Medical Machine Learning and What can be done about it?

What types of issues do we have currently with replication?

So why is this an issue?

What can be done?

Centralization of regulations and procedures

Independent Auditors

领英推荐

Better Incentives of Reproduction

Closing

Reach Out to Me

Devansh Devansh的更多文章

社区洞察

其他会员也浏览了

The Evolution of Healthcare AI: Debunking the "Overnight Success" Myth

AI-First Software for Healthcare: Revolutionizing Patient Care with Purpose-Built AI Solutions

The opportunity for AI in Healthcare

The GenAI Renaissance in Life Science Industries

Google Cloud’s AI Surge: A Game-Changer for Data Scientists and Medical Experts

What Should a Doctor Know About AI?

A brief round-up of AI-enabled healthcare & medicine

An Analysis of "The Power of Prompting" paper

Artificial intelligence a great tool, but no replacement for human intelligence

The Growing Role of Data Scientists in Developing AI Solutions for Healthcare: Key Skills and Challenges in Hiring

What is causing the lack of sharing of data/details in Medical Machine Learning and What can be done about it?

What types of issues do we have currently with replication?

So why is this an issue?

What can be done?

Centralization of regulations and procedures

Independent Auditors

领英推荐

Better Incentives of Reproduction

Closing

Reach Out to Me

Devansh Devansh的更多文章

Most important AI Developments to watch for Founders and Investors: March 2025

5 AI Studies Every Builder Must Know (but probably doesn’t)

What you should know in AI, Software, Business, and Tech- 3/4/2025

The Next AI Revolution: Why Diffusion-Based Language (like Mercury LLM) Models Are a Big Deal

Why Research is Expensive

What you should know in AI, Software, Business, and Tech- 2/19/2025

How Google uses AI to save Millions of Dollars on Computing Chip Design

The Chinese Philosophy to Building Flexible AI

How to develop the most Important Skill Required for AI

Why Deepseek is sharing their R1 AI Model publically

社区洞察

其他会员也浏览了

The Evolution of Healthcare AI: Debunking the "Overnight Success" Myth

AI-First Software for Healthcare: Revolutionizing Patient Care with Purpose-Built AI Solutions

The opportunity for AI in Healthcare

The GenAI Renaissance in Life Science Industries

Google Cloud’s AI Surge: A Game-Changer for Data Scientists and Medical Experts

What Should a Doctor Know About AI?

A brief round-up of AI-enabled healthcare & medicine

An Analysis of "The Power of Prompting" paper

Artificial intelligence a great tool, but no replacement for human intelligence

The Growing Role of Data Scientists in Developing AI Solutions for Healthcare: Key Skills and Challenges in Hiring