Layer L1

Data

The raw input. What data do you have that nobody else can get?

Why it matters

Proprietary datasets are the raw fuel. More agents = more demand for data. The L1b test: if your data is public, the model layer wins.

The Raw Gold Ore

The unrefined material pulled from the earth. Some mines have pure veins (proprietary data), others have common dirt (public data). Public data is already mined by everyone. The L1b test: if your data is public, the model layer wins.

The 5 sublayers

L1a

Public & Open Data

Common Crawl, Wikipedia, government data, open datasets

L1b

Proprietary Data

Licensed, paywalled, or internally generated training corpora

L1c

Behavioral & Sensor Data

Clicks, sessions, interaction logs, and camera, LiDAR, IMU, telemetry, and physical-world sensor streams for robotics and autonomy

L1d

Outcome Data

Labels, results, conversions, win/loss, audit trails, what actually happened after the model acted

L1e

Synthetic & Simulation Data

Machine-generated corpora and simulated environments (Isaac Sim, CARLA, Omniverse, world-sim) for training, augmentation, and embodied agent rollout

, Layer diagnostic card · SCOI v1

Is a company really at L1?

Raw input the stack learns from, and crucially, data nobody else can legally or practically obtain.

Inclusion tests · include if ALL

Holds data that is proprietary (L1b), behavioral (L1c), or outcome (L1d), not public crawl.
Data refreshes from a source the company structurally controls (a workflow, a relationship, a contract).
Removing the data degrades the product in a way no public source can repair.

Exclusion tests · exclude if ANY

Trained on public web only, every competitor can match.
Data is licensed non-exclusively from a third party (renting, not owning).
User-generated content the user can take elsewhere with no friction.

The L1 removal test

Remove the proprietary L1 and the product collapses into a generic L2 call with a prompt. If the answer to 'why us' is the model, you're not at L1.

Economic work this layer does

Provides the only ingredient L2 cannot synthesize: a verifiable, exclusive view of some slice of reality.

Canonical examples

Bloomberg
Decades of proprietary financial data + terminal-locked behavioral signal.
Tempus
Clinical + genomic outcome data from real treatment, structurally hard to replicate.
Apollo.io
Behavioral + outcome data on B2B contacts compounds with usage.

Anti-examples · look-alikes that fail

Most 'data + AI' decks
Public scrapes relabeled as proprietary. L1a, not L1b.
Stability AI training corpus
Open data → open weights → no L1 moat after release.
Stack Overflow
Once-defensible L1 commoditized by models that trained on it.

Disagree with a classification?Open the classification table →

Who's playing here

Apollo.ioBloombergZoomInfoScale AI

Verdict: Structurally safe. API-first wins.

Case studies touching L1

Stack Overflow: When Your Community Becomes Training Data

Stack Overflow's traffic dropped roughly 35–50% after ChatGPT shipped. Fifteen years of community-built knowledge, packaged as L7b content and scraped into L2 training sets. The community that built the data captured none of the value; the model layer captured all of it. A textbook case of L1 data mis-packaged as L7 content.

Apollo vs ZoomInfo: Same Layer, Opposite Strategies, Different Fates

Both sit at L1b, proprietary data. But Apollo went API-first and headless. ZoomInfo charges premium for an L7b UI wrapper. In an agent-first world, the UI tax is a liability. The data refinery wins.

Sierra's Memory Moat: Why L8 Beats Salesforce's Agentforce

Sierra and Salesforce Agentforce look like the same product on stage, an AI agent that resolves customer issues. The Cube projection shows they are structurally opposite. Sierra was architected as L1c behavioral data + L5d operating playbooks + L8c network learning from day one: every resolution compounds into per-customer memory. Agentforce is L5 bolted onto Salesforce's existing L1, with no compounding loop. Same demo, opposite trajectories.

Harvey AI Through the Layers

Harvey is built across four sublayers, L1b (licensed case law), L3a (compliance gates), L5b (legal reasoning scaffolds), L8d (institutional memory of matters). A useful case for mapping how a vertical-AI company actually stacks up, and where horizontal platforms can and can't reach.

L0 Infrastructure L2 Models