LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
Highlights
- ~24,000 instances from 1,000 real U.S. judicial opinions
- Strongest models score below 7/100 on citation retrieval
- 94%+ of 20 models show high misleading-answer rates
Abstract
Legal AI systems tend to generate plausible-sounding but fabricated case citations. We introduce LegalCiteBench, roughly 24,000 evaluation instances derived from 1,000 authentic U.S. judicial opinions, spanning five citation-focused tasks: retrieval, completion, error detection, case matching, and verification. Exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion, and over 94% of 20 evaluated models show high misleading-answer rates on retrieval-heavy tasks. Model scale and legal pretraining offer minimal improvement, while uncertainty prompts reduce — but do not eliminate — confident hallucination of legal authorities.
arXiv:2605.10186