Skip to content

Recipe — PBMC 3K, end-to-end

A full walkthrough of STELLAR against the 10X Genomics public PBMC 3K dataset (~2,700 cells × ~13k genes, scanpy-processed). This is the example shipped under examples/pbmc_3k/ and the same one CI ingests on every PR.

By the end of this recipe you'll have:

  • a pbmc_3k.h5ad with HVGs + leiden clustering + UMAP,
  • the LanceDB + DuckDB stores under data/,
  • uvicorn serving the API on 127.0.0.1:18901,
  • the SPA available at http://localhost:18901/pbmc_3k/.

No screenshots in this first pass — they'll land in v0.2 once the SPA's layout has stabilised.

Step 0 — Install

pip install 'stellar-atlas[dev]' scanpy

You need:

  • stellar-atlas itself (any extras you want);
  • [dev] to pull in httpx so the bootstrap.py script can call scanpy's loader cleanly;
  • scanpy for sc.datasets.pbmc3k_processed() and the standard preprocessing steps.

A bare pip install stellar-atlas is sufficient if you already have a processed h5ad — the dev extras above are only for fetching + building the example data.

Step 1 — Build the example h5ad

python examples/pbmc_3k/bootstrap.py

What this does (read the script for the gory details):

  1. Calls sc.datasets.pbmc3k_processed() — downloads the cached 10X PBMC 3K dataset (~6 MB).
  2. Picks 2,000 highly-variable genes.
  3. Runs sc.pp.neighbors, sc.tl.leiden, sc.tl.umap with the tutorial defaults.
  4. Rewrites the leiden labels into friendlier names (CD14+ Mono, Memory T, …) and stores them in obs["leiden_label"].
  5. Stamps every cell with condition="healthy" and donor_id="PBMC_3K" so the STELLAR cohort schema has values to read.
  6. Writes data/raw/pbmc_3k.h5ad.

You should see ~30 seconds of scanpy output and a ~6 MB h5ad on disk:

ls -lh data/raw/pbmc_3k.h5ad
# -rw-r--r--  1 you you  5.8M ... data/raw/pbmc_3k.h5ad

Step 2 — Inspect the example stellar.yaml

cat examples/pbmc_3k/stellar.yaml

The interesting bits:

project:
  slug: pbmc_3k
  title: "PBMC 3K  STELLAR Example"
  base_url: "/pbmc_3k/"

input:
  matrices:
    - { name: primary, path: data/raw/pbmc_3k.h5ad, role: primary }

cohort:
  cell_type_column: leiden_label   # bootstrap stored friendly names here
  condition_column: condition      # bootstrap set every cell to "healthy"
  donor_column:     donor_id       # bootstrap set every cell to "PBMC_3K"
  umap:
    obsm_key: X_umap

Every optional module is enabled: false for this example — we're exercising the core path.

Step 3 — Ingest

stellar ingest --config examples/pbmc_3k/stellar.yaml

The orchestrator:

  1. Validates stellar.yaml against the pydantic schema.
  2. Reads the h5ad, builds the gene-major Lance store at data/lance/expression_primary.lance/.
  3. Writes per-cell metadata (cells_v view) and per-gene metadata into data/duckdb/atlas.duckdb.
  4. Bakes the UMAP coordinates to data/static/coords_primary.arrow.
  5. Iterates the enabled modules' ingest() — none for this example.

The step is idempotent: re-running drops + recreates the stores without leftover state.

ls data/
# duckdb  lance  parquet  static

Step 4 — Serve

stellar serve --config examples/pbmc_3k/stellar.yaml

uvicorn comes up on 127.0.0.1:18901. Expect log lines like:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:18901 (Press CTRL+C to quit)

Leave it running and open another terminal for the next step.

Step 5 — Hit the API

The runtime config the SPA pulls at boot:

curl -s http://localhost:18901/api/config | python -m json.tool

You should see your project title, branding color, module map (all enabled: false), and the matrix descriptors.

A coords request for the first 500 cells — the same call the SPA's UMAP renderer makes at boot, modulo limit:

curl -sX POST http://localhost:18901/api/embedding/coords \
     -H 'content-type: application/json' \
     -d '{"limit": 500}' \
     -o /tmp/coords.arrow
ls -lh /tmp/coords.arrow

The response is Apache Arrow IPC (content-type: application/vnd.apache.arrow.stream); use pyarrow.ipc.open_stream in Python to read it.

The per-cell-type roster:

curl -s http://localhost:18901/api/describe | python -m json.tool

…lists every distinct leiden_label with its cell count.

Step 6 — Open the SPA

In a browser:

http://localhost:18901/pbmc_3k/

You should see the STELLAR header (teal accent, project title from stellar.yaml), the UMAP coloured by cell type, and the Intro + UMAP tabs. Other tabs (DE, Network, …) are absent — they're gated on the module being enabled in stellar.yaml.

Step 7 — Tear down

Ctrl-C in the uvicorn terminal stops the server. Re-running stellar serve picks up the exact same state — nothing on disk is ephemeral aside from the Python process itself.

To start fresh:

rm -rf data/lance data/duckdb data/static data/parquet
stellar ingest --config examples/pbmc_3k/stellar.yaml

Going further

  • Enable a module: pick one from Modules, edit stellar.yaml, drop the input parquet under the module's source_dir, and re-run stellar ingest.
  • Build your own atlas: stellar init my_atlas scaffolds a fresh project ready for your own h5ad.
  • Deploy: see Deploy for the nginx + systemd recipes.