Medical researchers from the Icahn School of Medicine at Mount Sinai recently conducted a study on artificial intelligence (AI) chatbots, revealing insights into their potential application in evidence-based medicine.
The Experiment: The Mount Sinai team experimented with various off-the-shelf consumer-facing large language models (LLMs), including ChatGPT 3.5 and 4, Gemini Pro, LLaMA v2, and Mixtral-8x7B. These models were prompted with information, such as "you are a medical professor," and asked to follow evidence-based medical (EBM) protocols to suggest treatment for test cases.
AI Performance: ChatGPT 4 emerged as the most successful model, achieving an accuracy of 74% over all cases, surpassing ChatGPT 3.5 by approximately 10%. The researchers concluded that LLMs can act as autonomous practitioners of evidence-based medicine, utilizing their ability to interact with real-world healthcare systems.
"LLMs can be made to function as autonomous practitioners of evidence-based medicine. Their ability to utilize tooling can be harnessed to interact with the infrastructure of a real-world healthcare system and perform the tasks of patient management in a guideline-directed manner."
Autonomous Medicine: Evidence-based medicine (EBM) relies on past cases to guide treatment decisions for similar cases. The researchers highlighted that clinicians often face challenges with information overload due to numerous interactions and treatment paths.
"Clinicians often face the challenge of information overload with the sheer number of possible interactions and treatment paths exceeding what they can feasibly manage or keep track of."
LLMs as Versatile Tools: The study suggests that LLMs can assist by handling tasks typically managed by human medical experts, such as ordering and interpreting investigations or issuing alarms. This allows human professionals to focus on physical care.
"LLMs are versatile tools capable of understanding clinical context and generating possible downstream actions."
Current Limitations: However, the researchers' claims may be influenced by their perception of LLMs, stating twice in the document, "We demonstrate that the capacity of LLMs to reason is a profound ability." There is no consensus among computer scientists regarding LLMs' reasoning capabilities or the possibility of achieving artificial general intelligence in the near future.
The paper doesn't define artificial general intelligence or address ethical concerns related to integrating unpredictable automated systems into clinical workflows. Additionally, it acknowledges the risk of LLMs occasionally fabricating nonsensical information, known as "hallucinating."
While AI chatbots show promise in evidence-based medicine, ongoing research is essential to address their limitations and ensure their safe and effective integration into healthcare systems.