From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization

C. G. Belem, P. Pezeshkpour, H. Iso, S. Maekawa, N. Bhutani, E. Hruschka

Findings of the Annual Conference of the North American Chapter of the ACL (NAACL 2025), 2025

PDF / Paper
Representative figure for From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
An illustrative example of hallucination in multi-document summarization: the LLM summarizes information not shared across documents, raising concerns about trustworthiness in MDS settings.

Abstract

Although many studies have investigated and reduced hallucinations in large language models (LLMs) for single-document tasks, research on hallucination in multi-document summarization (MDS) tasks remains largely unexplored. We create two novel benchmarks to investigate how hallucinations manifest when LLMs summarize topic-specific information across multiple documents. Our findings reveal that up to 75% of the content in LLM-generated summaries is hallucinated, with hallucinations more likely to occur towards the end of the summaries. When tasked with summarizing non-existent information, GPT-3.5-turbo and GPT-4o generated summaries approximately 79% and 44% of the time respectively. We identify that most errors stem from either failing to follow instructions or producing overly generic insights. While simple post-hoc baselines show only moderate effectiveness in reducing hallucinations, our work underscores the necessity for more systematic mitigation approaches in multi-document summarization tasks.