AI startup Mendel and the University of Massachusetts Amherst (UMass Amherst) have jointly published a study detecting hallucinations in AI-generated medical summaries.
The study evaluated medical summaries generated by two large language models (LLMs), GPT-4o and Llama-3. It categorises the hallucinations into five categories based on where they occur in the structure of medical notes – patient information; patient history; symptoms, diagnosis, surgical procedures; medicine-related instructions; and follow-up.
The study found that summaries created by AI models can “generate content that is incorrect or too general according to information in the source clinical notes”, which is called faithfulness hallucination.
AI hallucinations are a well-documented phenomenon. Google’s use of AI in its search engine has prompted some absurd responses, such as “eating one small rock per day” and “adding non-toxic glue to pizza to stop it from sticking”. However, in cases of medical summaries, these hallucinations can undermine the reliability and accuracy of the medical records.
The pilot study prompted GPT-4o and Llama-3 to create 500-word summaries of 50 detailed medical notes. Research found that GPT-4o had 21 summaries with incorrect information and 50 summaries with generalised information, while Llama-3 had 19 and 47, respectively. The researchers noted that Llama-3 tended to report details “as is” in its summaries whilst GPT-40 made “bold, two-step reasoning statements” that can lead to hallucinations.
The use of AI has been increasing in recent years, GlobalData expects the global revenue for AI platforms across healthcare to reach an estimated $18.8bn by 2027. There have also been calls to integrate AI with electronic health records to support clinical decision-making.
How well do you really know your competitors?
Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.
Thank you!
Your download email will arrive shortly
Not ready to buy yet? Download a free sample
We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below form
By GlobalDataThe UMass Amherst and Mendel study establishes a need for a hallucination detection system to boost the reliability and accuracy of AI-generated summaries. The research found that it took 92 minutes on average for a well-trained clinician to label an AI-generated summary, which can be expensive. To overcome this, the research team employed Mendel’s Hypercube system to detect hallucinations.
It also found that while Hypercube tended to overestimate the number of hallucinations. Furthermore, it detected hallucinations that are otherwise missed by human experts. The research team proposed the use of the Hypercube system as “an initial hallucination detection step, which can then be integrated with human expert review to enhance overall detection accuracy”.