Skip to main content
Premium Trial:

Request an Annual Quote

Benchmarking Tool to Assess Medical Large Language Models

A team from Google Research, the US National Library of Medicine, and DeepMind has developed a benchmark to assess the clinical knowledge of medical large language models (LLMs), which they then applied to two such LLMs. As they report in Nature, the researchers developed a tool dubbed MultiMedQA that incorporates information from six existing medical knowledge datasets as well as from a seventh dataset they developed of commonly searched health questions. Using MultiMedQA, the team assessed the Pathways Language Model (PaLM) and the related Flan-PaLM to find that Flan-PaLM had 67.6 percent accuracy on US medical licensing-style questions, which they noted was better than other approaches. However, Flan-PaLM struggled with answering consumer medical questions in long form. The team then made adjustments to the model, now called Med-PaLM, and a panel of clinicians then found its answers to be in line with the scientific consensus nearly 93 percent of the time, as compared to about 62 percent of the time for Flan-PaLM. "Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications," the researchers write.