AI Hallucination in Legal Research: What the Stanford Study Found and Why It Matters

A critical look at the reliability of AI-powered legal research tools, the Stanford study that tested them, and what practitioners need to know before relying on AI-generated citations.

In February 2025, researchers at Stanford Law School published a study that should concern every solicitor and barrister who has started using AI for legal research. The paper — “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools,” published in the Journal of Empirical Legal Studies — was the first preregistered empirical evaluation of the AI research tools offered by the two dominant legal information providers, Thomson Reuters (Westlaw) and RELX (LexisNexis).

The results were not what those companies' marketing departments would have chosen.

What the Researchers Tested

The Stanford team designed a systematic evaluation of three AI legal research tools: Westlaw's AI-Assisted Research, LexisNexis's Lexis+ AI, and OpenAI's GPT-4 as a baseline comparison. They posed legal research queries across multiple practice areas and evaluated the responses for hallucinations — defined as outputs that were either factually incorrect or “misgrounded,” meaning the AI cited a real source that did not actually support the claim being made.

This second category is particularly important. A hallucination isn't just a fabricated case that doesn't exist. It includes citing a real case for a proposition it doesn't stand for, attributing a rule to a statute that doesn't contain it, or describing a holding that doesn't match what the court actually decided. These subtle errors are arguably more dangerous than outright fabrication, because a solicitor who checks whether the case exists will find it — and may stop checking there, assuming the citation is sound.

The Numbers

Lexis+ AI hallucinated in 17% of queries tested. Westlaw's AI-Assisted Research hallucinated in 33% of queries. GPT-4, used without any legal database integration, hallucinated in 43% of queries.

Read those numbers again. One in three queries to Westlaw's AI tool produced unreliable output. One in six for Lexis+ AI. These aren't tools being used casually — they're marketed to practising lawyers for professional legal research, often with claims of high or even perfect accuracy.

LexisNexis had promoted Lexis+ AI as delivering “hallucination-free linked legal citations.” The Stanford study demonstrated that this claim was overstated. Thomson Reuters, to its credit, had been somewhat more cautious in its marketing language — but a 33% hallucination rate is not a margin-of-error problem. It's a fundamental reliability problem.

Why “Retrieval-Augmented Generation” Doesn't Solve the Problem

Both Westlaw and LexisNexis use a technique called Retrieval-Augmented Generation (RAG), where the AI retrieves relevant documents from a legal database before generating its response. The theory is that by grounding the AI's output in real documents, you eliminate hallucination. The Stanford study showed that this theory is wrong — or at least, that current implementations don't deliver on it.

The issue is architectural. In a RAG system, the AI retrieves documents, then generates a natural language response based on what it found. The generation step is where hallucination occurs. The AI might retrieve the right case but mischaracterise its holding. It might retrieve several cases and conflate their ratios. It might generate a plausible-sounding legal principle and attribute it to a source that says something different. The retrieval step helps, but it doesn't constrain the generation step enough to guarantee accuracy.

This is not a criticism of RAG as a technique. It's a recognition that the “last mile” — the step where AI turns retrieved documents into a narrative response — introduces error that cannot be fully eliminated by better retrieval alone.

What This Means for Practitioners

If you're a solicitor or barrister using AI legal research tools, the Stanford study has practical implications for how you work.

You cannot treat AI-generated citations as verified. Every citation in an AI-generated research memo needs to be checked against the original source. This is true for Westlaw, true for LexisNexis, and true for any general-purpose AI like ChatGPT or Claude used directly. The hallucination rates are too high to justify reliance without verification.

The risk is highest for misgrounded citations. A fabricated case is easy to catch — you search for it and it doesn't exist. A real case cited for the wrong proposition is much harder to detect, because the case exists, the citation format is correct, and the AI's description of the holding sounds plausible. Catching this error requires actually reading the case, which is exactly the work the AI was supposed to save you.

Practice area matters. The Stanford study found variation across practice areas, with some areas showing higher hallucination rates than others. If you're researching a niche or complex area of law — tax, immigration, regulatory — the risk may be higher because the AI has less training data to draw from and the legal principles are more technical.

Professional conduct obligations haven't changed. The SRA Code of Conduct still requires competent and diligent service. Citing a case that doesn't say what you claim it says is a professional conduct issue whether the error originated with you or with an AI tool. “Westlaw's AI told me” is not a defence to a complaint about misleading the court.

The Deeper Problem: Trust Architecture

The hallucination problem is fundamentally about trust architecture — how a system is designed to earn and deserve reliance.

Traditional Westlaw and LexisNexis (without AI) are databases. You search, you get documents. The documents are what they are. The trust architecture is simple: the database reproduces the source material, and you read it yourself. Errors can occur in transcription or indexing, but the system doesn't generate claims about what cases mean — that's your job.

AI-assisted features reverse this architecture. The system generates claims — “this case establishes that…” or “the leading authority on this point is…” — and you're expected to trust those claims. The trust now depends not just on the accuracy of the underlying database, but on the accuracy of the AI's interpretation of that database. That's a much harder problem, and the Stanford study shows it hasn't been solved.

A different approach is possible. Instead of asking AI to generate claims and hoping the retrieval step keeps it honest, you can design a system where AI assists with retrieval and organisation — finding relevant authorities, mapping citation networks, identifying how cases treat each other — while keeping the interpretive claims grounded in verifiable data rather than generated narrative.

This is the approach taken by Search the Law, which searches 21 official UK legal databases simultaneously, maps citation networks across 191,500+ citation pairs, and classifies how each authority has been treated by subsequent courts (applied, distinguished, doubted, or overruled). Every citation links directly to the original judgment on The National Archives or the relevant government database. The platform uses AI to structure and analyse research outputs, but the citations themselves are retrieved from and verified against official sources through a multi-stage verification pipeline — not generated by the AI model.

The distinction matters. When an AI generates a citation, it's producing text that looks like a citation. When a system retrieves a citation from an official database and verifies it against the source, the citation is the source. The error modes are categorically different. A retrieval system might miss a relevant case or misclassify a citation treatment, but it cannot fabricate a case that doesn't exist or attribute a holding to a case that doesn't contain it. The floor of reliability is structurally higher.

What About Accuracy Rates?

It's worth putting the Stanford numbers in context. A 17% hallucination rate (Lexis+ AI) means 83% of outputs were reliable. A 33% rate (Westlaw AI) means 67% were reliable. Those aren't terrible numbers for a general-purpose AI tool — they're remarkably good compared to using ChatGPT directly (57% reliability). But “remarkably good for AI” and “reliable enough for professional legal work” are different standards.

Consider what a 17–33% error rate means in practice. A solicitor who runs ten AI-assisted research queries in a week will receive between two and three responses containing hallucinated content. If they don't catch those errors — and the Stanford study shows that the errors are often subtle enough to pass casual review — those errors enter advice letters, skeleton arguments, and court submissions. The consequences range from professional embarrassment to negligence claims.

Research these cases on Search the Law

Search verified UK case law — every citation links to real judgments, not AI-generated summaries.

Fabricated citation sanctions Wasted costs orders Duty of candour to the court

By comparison, a system that retrieves citations from official databases and verifies them against source material has a fundamentally different error profile. Search the Law currently estimates its citation chain accuracy at approximately 90% across nearly 200,000 citation pairs — and the errors that do occur are classification errors (characterising a treatment as “applied” when it should be “distinguished”), not fabricated citations. The practical difference is significant: a misclassified treatment is a nuance error that a competent practitioner will catch when reading the case; a fabricated citation is a trap that can survive multiple levels of review.

The Path Forward

The Stanford study doesn't mean AI has no place in legal research. It means the current generation of AI tools from the dominant providers hasn't solved the reliability problem, and practitioners need to adjust their expectations and workflows accordingly.

The most promising direction is not to make AI better at generating legal claims — it's to make AI better at retrieving, organising, and verifying legal information while keeping the practitioner in the interpretive role. Find the cases, map the citation networks, classify the treatments, present the authorities — but let the solicitor or barrister decide what they mean.

That's not a step backward from the promise of AI-assisted research. It's a recognition that the most valuable thing AI can do for legal professionals is not to replace their judgment, but to give them better raw material to exercise it on.

The Stanford study “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools” by Varun Magesh et al. was published in the Journal of Empirical Legal Studies in 2025. The full paper is available at Stanford Law School.

Search the Law searches 21 official UK legal databases with citation verification against original sources. Every citation links to the published judgment. Try it free at searchthe.law.