The raw input. What data do you have that nobody else can get?
Why it matters
Proprietary datasets are the raw fuel. More agents = more demand for data. The L1b test: if your data is public, the model layer wins.
The Raw Gold Ore
The unrefined material pulled from the earth. Some mines have pure veins (proprietary data), others have common dirt (public data). Public data is already mined by everyone. The L1b test: if your data is public, the model layer wins.
The 5 sublayers
L1a
Public & Open Data
Common Crawl, Wikipedia, government data, open datasets
L1b
Proprietary Data
Licensed, paywalled, or internally generated training corpora
L1c
Behavioral & Sensor Data
Clicks, sessions, interaction logs, and camera, LiDAR, IMU, telemetry, and physical-world sensor streams for robotics and autonomy
L1d
Outcome Data
Labels, results, conversions, win/loss, audit trails, what actually happened after the model acted
L1e
Synthetic & Simulation Data
Machine-generated corpora and simulated environments (Isaac Sim, CARLA, Omniverse, world-sim) for training, augmentation, and embodied agent rollout
, Layer diagnostic card · SCOI v1
Is a company really at L1?
Raw input the stack learns from, and crucially, data nobody else can legally or practically obtain.
Inclusion tests · include if ALL
Holds data that is proprietary (L1b), behavioral (L1c), or outcome (L1d), not public crawl.
Data refreshes from a source the company structurally controls (a workflow, a relationship, a contract).
Removing the data degrades the product in a way no public source can repair.
Exclusion tests · exclude if ANY
Trained on public web only, every competitor can match.
Data is licensed non-exclusively from a third party (renting, not owning).
User-generated content the user can take elsewhere with no friction.
The L1 removal test
Remove the proprietary L1 and the product collapses into a generic L2 call with a prompt. If the answer to 'why us' is the model, you're not at L1.
Economic work this layer does
Provides the only ingredient L2 cannot synthesize: a verifiable, exclusive view of some slice of reality.
Canonical examples
Bloomberg
Decades of proprietary financial data + terminal-locked behavioral signal.
Tempus
Clinical + genomic outcome data from real treatment, structurally hard to replicate.
Apollo.io
Behavioral + outcome data on B2B contacts compounds with usage.
Anti-examples · look-alikes that fail
Most 'data + AI' decks
Public scrapes relabeled as proprietary. L1a, not L1b.
Stability AI training corpus
Open data → open weights → no L1 moat after release.
Stack Overflow
Once-defensible L1 commoditized by models that trained on it.