Full framework
    rock

    Layer L1

    Data

    The raw input. What data do you have that nobody else can get?

    Why it matters

    Proprietary datasets are the raw fuel. More agents = more demand for data. The L1b test: if your data is public, the model layer wins.

    The Raw Gold Ore

    The unrefined material pulled from the earth. Some mines have pure veins (proprietary data), others have common dirt (public data). Public data is already mined by everyone. The L1b test: if your data is public, the model layer wins.

    The 5 sublayers

    L1a

    Public & Open Data

    Common Crawl, Wikipedia, government data, open datasets

    L1b

    Proprietary Data

    Licensed, paywalled, or internally generated training corpora

    L1c

    Behavioral & Sensor Data

    Clicks, sessions, interaction logs, and camera, LiDAR, IMU, telemetry, and physical-world sensor streams for robotics and autonomy

    L1d

    Outcome Data

    Labels, results, conversions, win/loss, audit trails, what actually happened after the model acted

    L1e

    Synthetic & Simulation Data

    Machine-generated corpora and simulated environments (Isaac Sim, CARLA, Omniverse, world-sim) for training, augmentation, and embodied agent rollout

    , Layer diagnostic card · SCOI v1

    Is a company really at L1?

    Raw input the stack learns from, and crucially, data nobody else can legally or practically obtain.

    Inclusion tests · include if ALL

    • Holds data that is proprietary (L1b), behavioral (L1c), or outcome (L1d), not public crawl.
    • Data refreshes from a source the company structurally controls (a workflow, a relationship, a contract).
    • Removing the data degrades the product in a way no public source can repair.

    Exclusion tests · exclude if ANY

    • Trained on public web only, every competitor can match.
    • Data is licensed non-exclusively from a third party (renting, not owning).
    • User-generated content the user can take elsewhere with no friction.

    The L1 removal test

    Remove the proprietary L1 and the product collapses into a generic L2 call with a prompt. If the answer to 'why us' is the model, you're not at L1.

    Economic work this layer does

    Provides the only ingredient L2 cannot synthesize: a verifiable, exclusive view of some slice of reality.

    Canonical examples

    • Bloomberg

      Decades of proprietary financial data + terminal-locked behavioral signal.

    • Tempus

      Clinical + genomic outcome data from real treatment, structurally hard to replicate.

    • Apollo.io

      Behavioral + outcome data on B2B contacts compounds with usage.

    Anti-examples · look-alikes that fail

    • Most 'data + AI' decks

      Public scrapes relabeled as proprietary. L1a, not L1b.

    • Stability AI training corpus

      Open data → open weights → no L1 moat after release.

    • Stack Overflow

      Once-defensible L1 commoditized by models that trained on it.

    Disagree with a classification?Open the classification table →

    Who's playing here

    Apollo.ioBloombergZoomInfoScale AI

    Verdict: Structurally safe. API-first wins.

    Case studies touching L1