Daily Newsletter

07 August 2024

Daily Newsletter

07 August 2024

Hallucinations in AI-generated medical summaries remain a grave concern

A study by Mendel and UMass Amherst shows different types of hallucinations in AI-summarised medical records and the need for robust detection.

Phalguni Deswal August 07 2024

AI startup Mendel and the University of Massachusetts Amherst (UMass Amherst) have jointly published a study detecting hallucinations in AI-generated medical summaries.

The study evaluated medical summaries generated by two large language models (LLMs), GPT-4o and Llama-3. It categorises the hallucinations into five categories based on where they occur in the structure of medical notes – patient information; patient history; symptoms, diagnosis, surgical procedures; medicine-related instructions; and follow-up.

The study found that summaries created by AI models can “generate content that is incorrect or too general according to information in the source clinical notes”, which is called faithfulness hallucination.

AI hallucinations are a well-documented phenomenon. Google’s use of AI in its search engine has prompted some absurd responses, such as “eating one small rock per day” and “adding non-toxic glue to pizza to stop it from sticking”. However, in cases of medical summaries, these hallucinations can undermine the reliability and accuracy of the medical records.

The pilot study prompted GPT-4o and Llama-3 to create 500-word summaries of 50 detailed medical notes. Research found that GPT-4o had 21 summaries with incorrect information and 50 summaries with generalised information, while Llama-3 had 19 and 47, respectively. The researchers noted that Llama-3 tended to report details “as is” in its summaries whilst GPT-40 made “bold, two-step reasoning statements” that can lead to hallucinations.

The use of AI has been increasing in recent years, GlobalData expects the global revenue for AI platforms across healthcare to reach an estimated $18.8bn by 2027. There have also been calls to integrate AI with electronic health records to support clinical decision-making.

The UMass Amherst and Mendel study establishes a need for a hallucination detection system to boost the reliability and accuracy of AI-generated summaries. The research found that it took 92 minutes on average for a well-trained clinician to label an AI-generated summary, which can be expensive. To overcome this, the research team employed Mendel’s Hypercube system to detect hallucinations.

It also found that while Hypercube tended to overestimate the number of hallucinations. Furthermore, it detected hallucinations that are otherwise missed by human experts. The research team proposed the use of the Hypercube system as “an initial hallucination detection step, which can then be integrated with human expert review to enhance overall detection accuracy”.

Uncover your next opportunity with expert reports

Steer your business strategy with key data and insights from our latest market research reports and company profiles. Not ready to buy? Start small by downloading a sample report first.

Newsletters by sectors

close

Sign up to the newsletter: In Brief

Visit our Privacy Policy for more information about our services, how we may use, process and share your personal data, including information of your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.

Thank you for subscribing

View all newsletters from across the GlobalData Media network.

close