All research
AI Safety & EvaluationarXiv preprint · May 11, 2026

LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

Highlights

  • ~24,000 instances from 1,000 real U.S. judicial opinions
  • Strongest models score below 7/100 on citation retrieval
  • 94%+ of 20 models show high misleading-answer rates

Abstract

Legal AI systems tend to generate plausible-sounding but fabricated case citations. We introduce LegalCiteBench, roughly 24,000 evaluation instances derived from 1,000 authentic U.S. judicial opinions, spanning five citation-focused tasks: retrieval, completion, error detection, case matching, and verification. Exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion, and over 94% of 20 evaluated models show high misleading-answer rates on retrieval-heavy tasks. Model scale and legal pretraining offer minimal improvement, while uncertainty prompts reduce — but do not eliminate — confident hallucination of legal authorities.

arXiv:2605.10186

2605.10186.pdf
Loading paper…
LegalCiteBench: Evaluating Citation Reliability i… | Phala