To stay up to date and work forward in their fields, scientists must have at their fingertips and in their minds thousands of published studies. Large language models (LLMs) show promise as a tool for exploring the vast scientific literature, but are they trustworthy when it comes to providing full and scientifically accurate answers to complex questions in specialized fields?
To find out, Cornell physicists and Google researchers engaged a panel of 12 human experts to test the ability of six LLM systems – ChatGPT, Claude and others – to understand scientific literature at the level of a specialist, using the field of high-temperature cuprates, a class of superconducting materials, as an example. Some systems perform better than others, they found. The study also…