Could GPT-3 pass the Spanish Medical exam?

Could GPT-3 pass the Spanish Medical exam?

Could GPT-3 pass the MIR? (Spanish medical test for new doctors)

It's a question all of us in AI and medicine have asked ourselves at one time or another. Well, I have tried it.

As you know, this past Saturday the applicants for public positions as resident internal medical interns in the national health system had to take the dreaded MIR exam. There are 200 questions on any medical subject that a recent graduate should know. Now that GPT-3 is so fashionable thanks to ChatGPT the question everyone asked me was, would I be able to answer these questions correctly?

I took the 14 neurology questions that Mariano Ruiz posted yesterday on Twitter, Mariano is a neurologist and a professor at an academy that prepares these tests. The tweet is here:

To do the tests I have taken the questions that the Ministry of Health has already published.?

I have translated them into English with an automatic translator based on machine translation.

And I have copied the question in the GPT-3 playground with a quite simple prompt, in this case it has been something as simple as : "Select the right answer from the above medical test:". As you can see it is a very simple prompt, very simple and with little engineering, I am sure it can be much more elaborate.

The surprise is that out of the 14 questions GPT-3 provides the correct answer in 11 of them. It may seem a modest result but to me it looks like a promising result, let's see why:

  • The MIR questions are not particularly simple from a linguistic point of view. There are some that are formulated in reverse and you have to look for the negative answer.
  • For example, in a question on epilepsy, epileptic seizures are called comitial seizures, a very old term that, as I learned from Dr. Angel Aledo, comes from the time when the Romans used to hold elections. This is weird terminology for a language model but GPT-3 looks very resilient to these old terms.
  • Some questions are supported by images, in this case I have skipped this information, I have only used text.
  • In the questions where you have to complete, for example suggest a diagnostic test the GPT-3 model seems to perform very well. No surprise, the G of GPT comes from its generative feature. It is a model that does not understand concepts, it predicts the next word in a sequence, if what we ask it to do is to predict, it is more likely to answer well.
  • I have translated the questions into English. Why? Although GPT-3 works reasonably well in Spanish it is more accurate in English, at least for medical use. This is reasonable since the training dataset is mostly in English.

GPT-3 still has some way to go to perform at the level of a qualified professional, but this high number of hits is a sign that something is changing. Yesterday I read that when doctors do the MIR is when they know the most about medicine (knowing in a sense of accumulation of concepts). Then clinical practice adds experience, which is fundamental. My reflection is whether it makes sense to teach medicine the same way it was done 100 years ago, with a system that rewards the learning of concepts.

In my experience with medicine and having many medical friends, I have already seen that the best doctors for me are those who have skills that they have not learned in school, such as empathy, the ability to recognize ignorance and the constant desire to learn new ways of doing medicine.

Think about one thing, how many medical errors occur every day in the world that could be avoided if doctors had better tools to help them make decisions? I still don't understand how doctors make diagnostic and treatment decisions every day without any help, would you like to fly with a pilot who uses no instruments and flies the plane with a map, a compass and his good eyesight?

GPT-3 and the massive language models are proving to be fabulous tools to bring us face to face with our own inconsistencies.

If you want to know more about the work we are doing at Foundation 29 with large language model for diagnosis join the newsletter at DxGPT

Excelente contribución Julian Isla. Totalmente de acuerdo en la necesidad de que los médicos se apoyen en herramientas informáticas como estas para ser más certeros en sus diagnósticos

Miguel Rodríguez Rubio, MD

Bridging science and purpose to transform complex data into clear, actionable insights that improve patient care, empower healthcare decisions & deliver meaningful outcomes

2 年

These are promising results for GPT-3 but, to me, the key takeaway is that the MIR exam is not a very good way of assessing who is a good doctor and who isn’t. As you’ve mentioned in your post, there are many skills it doesn’t evaluate that are critical in being a good doctor (e.g. emotional skills, practical skills or critical thinking…).

Miguel Raimundo Arana Martin

Leading diverse & multinational teams to get results by executing on robust initiative-led approach

2 年

Amén!, and Stiller scraching thr surface of what is possible, as with all new things, mindset to accept is fundamental, reading the article i remember when Doctors use not to wash their hands an got angy toward leading colleages who suggested to do so... Thanks Julian for pushing the boundaries

回复

要查看或添加评论,请登录

Julian Isla的更多文章

  • Mi resumen del Ennova Health Day

    Mi resumen del Ennova Health Day

    Ayer estuvimos en el #Ennova Health Day organizado por Diario Médico & Correo Farmacéutico. Tal como prometí grabé la…

  • AI shaping the future of healthcare

    AI shaping the future of healthcare

    I've been invited to this workshop organized by European Medicines Agency and HMA about the use of AI in medicine. Make…

    2 条评论
  • El master de Sergio

    El master de Sergio

    ?Qué hace una foto de un ni?o en LinkedIn? Supuestamente Linkedin es una red profesional. Todo lo que hablamos aquí es…

    78 条评论
  • Sobre cuidar

    Sobre cuidar

    He tenido la suerte de que la Universidad Internacional Menéndez Pelayo me considere un profesor y me invite a sus…

    28 条评论
  • Harari could be more dangerous than AI

    Harari could be more dangerous than AI

    Consider this paragraph: “In the battle against disasters such as AIDS and Ebola, the balance is increasingly tipping…

    2 条评论
  • Bing Chat achieves top score in Spanish medical examination (MIR)

    Bing Chat achieves top score in Spanish medical examination (MIR)

    How would Bing Chat do the Spanish medical exam? We are currently experiencing a unique period in the field of…

  • The race for the COVID-19 vaccine

    The race for the COVID-19 vaccine

    (Amazing picture from u/junobee at Reddit, this is a real Moderna vaccine vial) These days I have been looking forward…

    1 条评论
  • No es la COVID-19. Son los datos, ?estúpido!

    No es la COVID-19. Son los datos, ?estúpido!

    En la campa?a de 1992 de Bill Clinton contra George W. Bush (el padre) había un claro favorito.

  • 10 women every day

    10 women every day

    This is the number of women who disappear in Mexico every single day. 10 women every day.

    1 条评论
  • Proteins descending the Everest

    Proteins descending the Everest

    Google is demonstrating again how an approach based on machine learning (deep learning in particular) can solve…

社区洞察

其他会员也浏览了