Blog

LegalCiteBench @ ICML 2026: Legal AI Needs Reliable Citation Evaluation

Jun 25, 20265 min read
LegalCiteBench @ ICML 2026: Legal AI Needs Reliable Citation Evaluation

LegalCiteBench: Evaluating Citation Reliability in Legal Language Models was accepted to the AI4Law @ ICML 2026 Workshop. The paper studies a failure mode that matters in real legal work: whether an AI system can name the right case and cite it correctly.

Authors: Sijia Chen, Hang Yin, Shunfan Zhou. On X, cite @bgmshana and @zhou49. Sijia Chen does not have a public X account listed.

Paper: https://arxiv.org/abs/2605.10186

Workshop: https://sites.google.com/view/ai4law-icml2026

LegalCiteBench summary card.
LegalCiteBench summary card.

The short version

LegalCiteBench tests citation reliability in the place where legal AI can fail quietly: exact authority. The benchmark asks models to retrieve, complete, verify, and correct real case citations from U.S. judicial opinions.

  • Scale: 23,646 evaluation instances built from 1,000 root judicial opinions.
  • Generation result: the best model scores 6.80/100 on citation retrieval and 6.35/100 on citation completion.
  • Verification result: the best model reaches 75.59 on citation error detection and 96.07 on case verification/correction.
  • Risk signal: 20 of 21 models have Misleading Answer Rate above 94%.

Why this benchmark matters

Many legal AI benchmarks measure reasoning, legal QA, or contract understanding. LegalCiteBench focuses on a narrower problem with real product risk: citation reliability. The benchmark asks whether a model can recover, complete, check, and correct case citations when it has to work from its internal knowledge.

This is a hard problem because legal citations are compact identifiers for authority. A citation can look completely plausible while pointing to the wrong case, the wrong reporter, or the wrong page. In a legal workflow, that kind of mistake can quietly contaminate research, briefs, and client advice.

How LegalCiteBench is built

LegalCiteBench is built from real U.S. judicial opinions in the Case Law Access Project. The benchmark starts with 1,000 root cases from the last decade and produces 23,646 evaluation instances. LLMs help write structured prompts and construction notes, while the ground truth stays anchored in citation lists and case metadata from the opinions.

Task-family scores across 21 models.
Task-family scores across 21 models.

The dataset covers five task families: citation retrieval, citation completion, citation error detection, case matching, and case verification/correction.

The task families form a spectrum. Cat1 and Cat2 ask the model to produce missing legal authority. Cat3 asks the model to find citation errors. Cat4 asks the model to match, verify, and correct case references. The paper’s main contrast is generation versus verification.

That contrast is why the chart matters. Low Cat1 and Cat2 scores show that exact citation recall is weak. Higher Cat3 and Cat4-2 scores show that models are more useful when the system gives them something to check.

Result 1: closed-book citation generation is still weak

The paper evaluates 21 models across closed-source APIs, open models, general-purpose models, and legal-domain models. On the two generation-heavy tasks, the best score is 6.80/100 for citation retrieval and 6.35/100 for citation completion. Every model stays below 7/100 on both.

Misleading Answer Rate by model, adapted from the paper Figure 3.
Misleading Answer Rate by model, adapted from the paper Figure 3.

MAR turns the failure into an operational metric. A high MAR means the model tends to give a concrete answer instead of admitting uncertainty. For legal workflows, that is the uncomfortable part: the output can look usable before anyone checks the citation.

That gap is the main result. A model can write fluent legal analysis and still fail at exact authority recovery. LegalCiteBench separates those two abilities by asking for the specific case name, volume, reporter, page, and related citation details.

Result 2: verification is much more reliable

The same models do better when the task becomes verification. The best score reaches 75.59 on citation error detection and 96.07 on case verification and correction. This is a useful distinction for product design: checking a candidate citation is much easier than generating the right authority from memory.

The practical pattern is clear: use retrieval and legal databases to produce candidate authorities, then use models as auditors, verifiers, and explanation engines. In legal AI, citation handling should be built like a validation pipeline, not a memory test.

Result 3: Misleading Answer Rate is high

The paper introduces Misleading Answer Rate, or MAR, to capture a specific failure: a model gives a concrete citation or case answer while scoring poorly on retrieval-heavy tasks. MAR matters because confident wrong answers are more dangerous than visible uncertainty.

Prompt-only abstention comparison.
Prompt-only abstention comparison.

Abstention helps when the model actually follows it. The benchmark shows that a prompt can make some models refuse more often. The missing citation knowledge still has to come from retrieval, source checking, and citation normalization outside the model.

The result is blunt: 20 of 21 models have MAR above 94%. The lowest model, GPT-5-mini, is still at 89.24%. In this benchmark, most models usually answer with something specific even when they are likely wrong.

Prompt-only abstention helps, with a ceiling

The paper also tests a lightweight abstention instruction: stop guessing when the exact citation cannot be verified. The effect varies by model. Qwen3-14B drops from 89.8% MAR to 69.9%; Llama-3.1-8B-Instruct drops from 100.0% to 62.7%; Mistral-7B-Instruct moves from 100.0% to 98.3%.

LegalCiteBench construction pipeline.
LegalCiteBench construction pipeline.

Why this matters for Phala

At Phala, we care about AI systems that can be checked. LegalCiteBench is a good example of the broader problem: users need more than a fluent answer. They need evidence, provenance, and a way to tell whether the system used the right source.

For legal AI, that means citation pipelines. For confidential and agentic AI more broadly, it means verifiable execution, auditable data flow, and systems that expose when they know the answer and when they need a source.

LegalCiteBench points to a practical conclusion: legal citation workflows need grounding, verification, and calibrated abstention. A deployable legal AI system should treat citations as data objects that must be retrieved, normalized, checked, and explained.

LegalCiteBench points to a practical conclusion: legal citation workflows need grounding, verification, and calibrated abstention. A deployable legal AI system should treat citations as verifiable objects.

The retrieval layer should find candidate authority. The normalization layer should parse and validate citation strings. The verification layer should check case identity and legal relevance. The generation layer should expose uncertainty when a citation cannot be confirmed.

The benchmark is diagnostic. It shows where model memory is weak, where verification is stronger, and why reliable legal products need a citation pipeline around the model.

Resources

Paper: https://arxiv.org/abs/2605.10186

Code/data: https://github.com/Sijia711/LegalCiteBench

Workshop: https://sites.google.com/view/ai4law-icml2026

Recent Posts

Related Posts