Roupen Odabashian, Hematology/Oncology Fellow at the Karmanos Cancer Institute, posted on LinkedIn:
“New Study at ASCO2025.
Can large language models like GPT-4 and Claude Opus reason like oncologists?
The way we’re currently evaluating large language models—with those shiny journal titles touting multiple-choice exam benchmarks for accuracy—is just horribly WRONG.
Would you trust a fresh-out-of-med-school doctor to treat your cancer based solely on passing a multiple-choice test, without any real-world experience handling complex cases with multiple, difficult treatment options?
In our study at ASCO2025, we assessed large language models using multiple-choice questions, but we focused on their clinical reasoning, not just their accuracy. And the results? Shocking.
We benchmarked the clinical reasoning of AI models using 273 breast oncology multiple-choice questions from the ASCO QBank.
Key findings: GPT-4 and Claude Opus both started with high accuracy (81.3% and 79.5%, respectively).
After applying chain-of-thought prompting to simulate stepwise reasoning: Claude’s performance improved to 86.4%. GPT-4’s accuracy slightly declined to 80.95%.
Common AI errors included. That’s where we looked at their clinical reasoning!
- Reliance on outdated guidelines
- Misinterpretation of clinical trial data
- Lack of individualized/multidisciplinary care reasoning
Conclusion: LLMs are promising tools, but still fall short in nuanced, real-world oncology decision-making. Human supervision remains essential.
More posts featuring ASCO25.