← labs/firmgen
experiment · synthetic firm simulator

An engine for generating synthetic law firm data.

FirmGen produces synthetic firm data by decoding, modeling, and replicating law firm "DNA" — the true factors and behaviors that define a firm's size, shape, and outcomes — calibrated against industry benchmarks and proprietary market research.

§ 01 — the problem

The biggest barrier to building legal data products is access to real data. Real firm data is gated by client confidentiality, engagement-letter terms, and security-review cycles that don't move at development pace. When teams do get access, they can't use it freely.

Synthetic alternatives fail practitioner review the moment a real practitioner looks at them. Vendor-supplied demo environments are theater, and customers see through them. The hard cases — write-off cascades, realization shifts, lateral hire spikes, year-end edge cases — only surface in production, against real customer data, with real consequences. FirmGen exists to remove that barrier.

§ 02 — firm DNA

A firm is the sum of its parts — and especially of its partners.

Two firms with the same headcount, the same Am Law tier, and the same nominal practice mix can produce radically different economics and behavior because their compositions differ. Different partner books and origination patterns. Different leverage models, practice and sector mix, office cost bases.

Recognizable firm shapes — Big Law generalist, mid-market regional, boutique specialist — emerge from coherent compositional DNA patterns observed across decades of real firms. They are not categories; they are signatures of compositional choices that real firms have made and that produce the operational shapes practitioners recognize on sight.

§ 03 — design philosophy

Revenue bottom-up. Expenses top-down. They meet at the cost card.

Most synthetic data efforts pick one orientation and force everything through it; the result is realistic on one side and obviously wrong on the other. FirmGen uses each methodology where it is structurally correct.

On the revenue side, the system models the underlying drivers of work — partner books, leverage shape, matter intensity, timekeeper utilization — and revenue emerges as time entries become bills become receipts. The lumpy December push, per-matter rate variance, partner-share concentration: none of these are imposed as rules. They come out of the model because the model captures the structure that produces them in real firms.

On the expense side, total firm overhead is anchored to industry expense ratios (typically 22–26% of revenue), split across categories real firms actually budget against, and allocated via the appropriate driver. Per-timekeeper fully-burdened cost emerges from this top-down allocation and closes the loop with the bottom-up revenue model.

§ 04 — six dimensions of realism

Six interlocking dimensions, each calibrated against named industry sources.

  • Workforce composition & lifecycle. How a firm staffs itself; how its people age, churn, and refresh. Calibrated against NALP attrition data, ABA Profile demographics, and Altman Weil reduced-hours surveys.
  • Engagement structure & matter shape. How clients and matters get formed, classified, and staffed. Pareto-distributed whale-cap on matters per client, lognormal matter intensity, Gamma-distributed origination across partners.
  • Pricing dynamics. Rate-card distributions, per-matter discount and premium patterns, year-over-year rate inflation, and AFAs with material-deviation realization.
  • Cash-flow dynamics. Billing realization, collection lag, December-push seasonality, and the bad-debt and credit-memo lifecycle. Calibrated against Citi-Hildebrandt and LawVision benchmarks.
  • Cost economics. Firm-level overhead budgeted top-down and allocated to people via real budgeting drivers, producing per-timekeeper fully-burdened cost rates.
  • Organization structure. A 32-practice / 11-department hierarchy, multi-office geographic footprint, and partner-share distributions shaped by the firm's compensation model.

These dimensions are not independent. A firm's compensation shape (lockstep through eat-what-you-kill) affects its origination concentration, which affects its matter distribution, which affects its leverage shape, which affects its staffing economics. Realism in any single dimension depends on coherence with the others.

§ 05 — what's modeled

Core entities, end-to-end.

  • Clients. Segmentation, churn, parent-client structures, industry/sector mix, pre-window history.
  • Matters. Lifecycle from open to close, billable classification, team composition and shape, pricing and fee arrangements, work intensity and pace.
  • Timekeepers. Seven-tier lawyer-and-paralegal taxonomy plus business-services functions; rate cards, hire patterns, class-year offsets, attrition, FTE tracking, and compensation.
  • Timesheets, bills, and payments. The complete work-to-cash lifecycle: time capture, pre-bill and bill generation, AFA and material-deviation distributions, seasonal collection cycle with lag, realistic adjustments, and accurate realization profitability.

Cross-cutting dynamics layered on top: seasonality, multi-year rate inflation, overall demand and staffing growth, leverage trends, and compensation factors. Cross-system referential integrity, temporal consistency, and structural firm dynamics are inherent to the model, not bolted on.

§ 06 — archetypes

Pre-configured bundles of correlated factors.

A real firm is not a random selection of factors. It is a tightly correlated bundle. A trophy boutique has high partner share and premium rates and lockstep compensation and narrow practice mix and single-office scope, all together. A high-leverage global volume firm has the opposite cluster across most dimensions.

Six archetypes ship today, anchored by empirical clustering of public AmLaw rankings data from 2015 through 2022. Three are published and validated to high fidelity against real firm data: Am Law 100 — Diversified Generalist, Am Law 200 — Mid-Market Regional, and Boutique Specialist. Selecting an archetype configures dozens of correlated factors at once. Custom archetypes for buyer-specific firm profiles are scoped per engagement.

§ 07 — calibration

Provenance, refresh-friendliness, audit-trail.

Every numeric parameter that materially affects FirmGen output is sourced from a versioned dataset with a documented citation index. There are no hardcoded constants based on a developer's intuition. Annual updates land as new vintage files; old vintages remain available for reproducibility. Output rows carry their dataset version, so any number in any output can be traced backward to the calibration source that produced it.

The current calibration draws on roughly twenty distinct industry sources — including American Lawyer / ALM, Citi-Hildebrandt, the Thomson Reuters / Georgetown State of the U.S. Legal Market, the ABA Profile of the Legal Profession, NALP attrition surveys and associate compensation data, Cushman & Wakefield / JLL / CBRE / Colliers commercial real estate reports, IBA practice surveys, and LawVision and Hildebrandt benchmarks for AR aging and bad-debt patterns. Published sources are supplemented by anonymized DNA artifacts extracted from multiple reference firms under NDA, and backed by 10+ years of direct firm advisory and analytics practice across the Am Law 100.

§ 08 — validation

Three layers of validation, on every run.

  • Sanity reporting. Entity counts, event counts, aggregate amounts, realization metrics, per-tier hours distributions, and matter size distributions are emitted as a structured report on every generation. Outputs that fall outside expected bands surface immediately.
  • DNA fidelity comparison. A structured profile is extracted from any populated output — synthetic or real — covering per-partner book composition, cohort segmentation, realization patterns, and origination-credit splits. Synthetic outputs are compared against the same profile extracted from reference firms validated under NDA, producing per-metric deviation reports.
  • Dashboard validation. Outputs run through a full Power BI semantic model with curated measures, then year-over-year metrics are gut-checked against AmLaw industry medians. Practitioner review by experienced firm operators is the final gate; outputs that don't pass don't ship.

Latest validation run · 2026-05-04 · 24 calibration factors · 22 pass / 2 fail · 91.7% pass rate · AmLaw-100 reference · 5-year window · 8.76M time entries.

§ 09 — scenario modeling

A validation harness, not just a dataset.

Realistic substrate is necessary but not sufficient. FirmGen also supports test-driven legal data infrastructure work through two layers of deliberate-imperfection injection.

  • Defect injection. Controlled adversarial conditions with ground-truth labels: duplicate timekeeper IDs from rehires, mid-period rate-card overlaps, AFA overflows, time entries that never reach a bill. Downstream ETL or dashboard logic can be validated against known-correct labels.
  • Scenario injection. Identifiable real-world complications with ground-truth tags: lateral hires arriving mid-window, partner promotions, books transferring between partners, client mergers, fee renegotiations, late-paying clients, contingency payouts. Scenario tags let consumers validate that their products handle the cases they claim to handle.

See firmgen output, or design a firm to your spec.