AI Safety & EvaluationarXiv preprint · May 11, 2026

LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

SCSijia Chen

Hang Yin

Shunfan Zhou

View on arXiv Download PDF

Highlights

~24,000 instances from 1,000 real U.S. judicial opinions
Strongest models score below 7/100 on citation retrieval
94%+ of 20 models show high misleading-answer rates

Abstract

Legal AI systems tend to generate plausible-sounding but fabricated case citations. We introduce LegalCiteBench, roughly 24,000 evaluation instances derived from 1,000 authentic U.S. judicial opinions, spanning five citation-focused tasks: retrieval, completion, error detection, case matching, and verification. Exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion, and over 94% of 20 evaluated models show high misleading-answer rates on retrieval-heavy tasks. Model scale and legal pretraining offer minimal improvement, while uncertainty prompts reduce — but do not eliminate — confident hallucination of legal authorities.

arXiv:2605.10186