experiment · evaluation harness · open source

Which AI skills actually produce better legal work?

AI "skills" — structured prompts that tell models how to do a specific job — are proliferating across the legal AI ecosystem. Anthropic ships them, indie developers publish them, startups sell them. But which ones actually work? Skillcheck answers that with data: a controlled bake-off harness that measures models, skills, and rubrics against expert-graded answer keys.

skillcheck on github ↗read the writeup on linkedin ↗

§ 01 — the premise

Skills are everywhere. Quality is not.

The premise is compelling: package expert knowledge into a skill, hand it to a model, and get work product that reflects professional judgment. Ethan Mollick has written about this, while warning that organizations need to stop relying on benchmarks and vibes and start systematically testing AI on their actual work. He's right — but how?

Does a 2,500-token methodology outperform a 130-word sticky note? Does either beat a bare prompt? Without a controlled evaluation, every shop is left to vibe-check its own playbooks against its own AI of choice — and the field accumulates skills faster than evidence.

§ 02 — the legal plugin panic

Three signals from a $285B market selloff.

When Anthropic dropped a set of legal skills for Claude on a Friday afternoon — one of eleven plugin packs, treated internally as a side project — the legal-data and SaaS markets shed roughly $285B between them. The market is fickle, but it's also a signal worth listening to. Three things were being repriced at once:

SaaS is software consumed by humans, priced per human. How many seats does an agent need? The market voted.
Expertise is knowledge delivered by humans, priced by the hour. What happens when expertise can be packaged and shared at scale? The market voted on that too — and since it can't short law firms, it punished the legal-data companies instead.
The skills shipped with the plugin pack used the same LLMs that every legaltech provider uses. So the only differentiator left is the instructions themselves. But do they work? Can they be trusted? How do we judge?

Skillcheck exists to answer that third question with data. Two findings come up over and over: model differences are real — all else equal, frontier models meaningfully outperform mid-tier and open-source ones, which is why a serious evaluation has to run across providers, not just one. And without data, the model only has the instructions you give it. Relying on training data alone is largely useless and borderline dangerous. Your expertise, written down clearly — and the institutional data behind it — is the moat.

§ 03 — the evaluation problem

You can't grade judgment with a checklist.

Evaluating AI on knowledge work isn't like testing an app or scoring a math test. It requires judgment — the same kind of professional judgment the skills are trying to encode. And that judgment is beyond a simple checklist. It involves pattern-matching based on experience, where subjectivity and objectivity intertwine and professional opinions are formed.

The goal of skillcheck isn't to eliminate subjectivity. It's to make judgment structured, repeatable, and transparent — grounded in established research on LLM-as-judge methodology rather than vibes.

§ 04 — what it does

Pick a task, pick your models, pick your skills, run the bake-off.

Skillcheck runs AI models through legal document review tasks with different skills and scores the results against expert-graded answer keys. Every prompt is a skill — from a bare "review this NDA" to a 2,500-token methodology built on academic benchmarks. It measures four things:

Issue detection. Did the model find what a senior attorney would find?
Skill comparison. Which skill produces the best work product on which model?
Model comparison. How do frontier, mid-tier, and open-source models stack up on the same task with the same playbook?
LLM-as-judge scoring. Optionally, a judge model evaluates response quality across rubric dimensions — identification, characterization, severity, and actionability.

The first task is NDA Review: a deliberately one-sided "mutual" NDA with 16 planted issues across three severity tiers. More tasks and skills can be added without code changes — drop a directory under skills/ with a skill manifest, test documents, and an answer key, and they appear in the dashboard.

§ 05 — scoring

Two scoring layers, grounded in published research.

Each test document has an expert answer key with issues classified into three tiers — must-catch (3×), should-catch (2×), and nice-to-catch (1×) — so a model that misses malpractice-tier issues is penalized harder than one that misses a polish item.

Quick scoring. Keyword detection from the answer key. Conservative — it can miss a valid detection but rarely produces false positives. Useful as a fast first pass.
LLM-as-judge scoring. A judge model evaluates each response against the answer key using research-backed techniques: chain-of-thought reasoning before scoring (G-Eval, Liu et al. 2023); binary per-issue detection with calibration examples for reproducibility (Husain 2024); explicit anti-verbosity instructions to prevent inflated scores for longer responses (Zheng et al. 2024); multi-judge panels across model families aggregated by majority vote to reduce correlated bias (PoLL, Verga et al. 2024); and self-enhancement bias detection that warns when judge and evaluated model share a provider (Wataoka et al. 2024).

§ 06 — what 300+ runs taught us

Findings from the first NDA bake-off.

Nine NDA review skills tested across frontier, mid-tier, and open-source models — including skills from an Am Law firm, legal-AI startups, indie developers, and a one-line baseline. The data was unkind to the detailed playbooks.

Many "expert" skills make models worse. A one-line instruction ("Review this NDA and identify issues") beat five of seven detailed methodologies. A blank prompt nearly matched it. Tell a model to check 12 specific things and it dutifully flags all 12, whether they're problems or not — the skill becomes a checklist to complete rather than a judgment to exercise.
The best skills coach calibration, not detection. Models have already "seen" what's in an NDA from training data. What they need is judgment — when to flag, when to move on, how to translate findings into a recommendation. The top-scoring skill didn't tell models exactly what to look for; it calibrated their judgment on how to spot risk and decide what matters.
The model matters more than the skill. The best skill on the weakest model (GPT-5 Nano: 0.689) couldn't match a bare prompt on the strongest (Claude Opus: ~0.93). Skills amplify what's already there; they don't substitute for it.
Models can't say "this is fine." On a well-drafted vanilla NDA where the right answer is essentially sign it with minor notes, models averaged 0.578 — versus 0.993 on a document with obvious red flags. They redline standard provisions, recommend NEGOTIATE when the answer is SIGN, and struggle with the hardest professional judgment call: knowing when the work product in front of you is good enough.

§ 07 — the ceiling

Why universal skills hit a wall.

The winning skill scored 92% on commercial NDAs. We then threw it a few curveballs — an M&A NDA and an investor-oriented NDA. Both scored terribly: 33.5% and 42.6%.

Imagine a skill that's excellent at evaluating single-family home inspections. It knows the checklist: roof, foundation, electrical, plumbing. Now hand it a commercial warehouse with environmental remediation requirements, or a historic brownstone with landmark preservation restrictions. The document looks similar — it's still a building inspection. But the issues that matter are completely different. The home-inspection skill doesn't know a Phase II environmental report is missing, because that concept doesn't exist in its world.

That's what happened with our NDA evaluations. The universal skill applied commercial-NDA thinking to documents that required deal-specific knowledge (M&A) or situational calibration (investor). Same document label, different expertise required. And document type is only half the problem — the same NDA reads differently depending on who you are: buyer or seller, partner or competitor, routine deal or bet-the-company.

§ 08 — what's next

Skills + tools + data.

Three directions show promise, and all of them point past the standalone skill toward something more like an orchestration layer:

Multi-step approaches. Classify the situation and document type before applying a skill. Triage before treatment. Not one skill to rule them all, but an orchestration layer that picks the right skill for the job.
Meta-skills. The principles that emerged from these evaluations — calibrating judgment rather than prescribing detection, defining explicit risk and decision thresholds — aren't NDA-specific. They're principles for encoding professional judgment in any domain.
Better processing. Chunking documents into clauses or provisions for more precise analysis and comparison. Small NDAs don't really need it; longer agreements and deal-specific documents do, and early testing with frameworks like Docling and the Isaacus API is showing interesting results.
Better data. Imagine a system that could draw on every NDA your firm has ever reviewed — the redlines, the back-and-forth with clients, the partner notes explaining why this clause matters in this context but not that one. Skills built on a firm's actual judgment across thousands of matters are a fundamentally different proposition than a generic playbook.

Submitting a skill to a model is the tip of the iceberg. The data underneath is what makes it work — and it's what makes it uniquely valuable from firm to firm, and lawyer to lawyer.

Skillcheck is MIT-licensed. Read the source, run the harness, send a PR.

skillcheck on github ↗read the writeup on linkedin ↗