scrna2/6 Jupyter Notebook lamindata

Standardize and append a batch of data#

Here, we’ll learn

  • how to standardize a less well curated collection

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.settings.verbosity = "hint"
bt.settings.auto_save_parents = False
💡 connected lamindb: testuser1/test-scrna
ln.transform.stem_uid = "ManDYgmftZ8C"
ln.transform.version = "1"
ln.track()
💡 Assuming editor is Jupyter Lab.
💡 notebook imports: bionty==0.42.3 lamindb==0.69.2
💡 saved: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', key='scrna2', version='1', type=notebook, updated_at=2024-03-26 12:02:46 UTC, created_by_id=1)
💡 saved: Run(uid='0SDPuvwMMvHbz8XZ6Kxl', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_0SDPuvwMMvHbz8XZ6Kxl.txt

Standardize a data shard#

Let’s now consider a collection with less-well curated features:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We are still working with human data, and can globally set an organism:

bt.settings.organism = "human"
validator = ln.Validate.from_anndata(adata, var_field=bt.Gene.symbol, obs_fields={"cell_type": bt.CellType.name})
3 non-validated features are not registered with Feature.name: ['percent_mito', 'n_genes', 'louvain']!
      → to lookup categories, use .lookup().['feature']
      → to register, run register_features(validated_only=False)
✅ registered 5 labels from public with Gene.symbol: ['GPX1', 'SOD2', 'RN7SL1', 'SNORD3B-2', 'IGLL5']
11 non-validated labels are not registered with Gene.symbol: ['RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5']!
      → to lookup categories, use .lookup().['variables']
      → to register, set validated_only=False

Standardize & validate genes #

This data shard is indexed by gene symbols which we’ll want to map on Ensemble ids:

Now that all symbols are validated, let’s convert them to Ensembl ids via standardize(). Note that this is an ambiguous mapping and the first match is kept because the keep arg of .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# We only want to register data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
💡 standardized 754/765 terms

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
validator = ln.Validate.from_anndata(adata_validated, var_field=bt.Gene.ensembl_gene_id, obs_fields={"cell_type": bt.CellType.name})
3 non-validated features are not registered with Feature.name: ['percent_mito', 'n_genes', 'louvain']!
      → to lookup categories, use .lookup().['feature']
      → to register, run register_features(validated_only=False)
validator.validate()
💡 inspecting 'variables' by Gene.ensembl_gene_id
✅    all variabless are validated
💡 inspecting 'cell_type' by CellType.name
9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
      → register terms via .register_labels('cell_type')
False

Standardize & validate cell types #

Non of the cell types can be automatically registered:

validator.register_labels("cell_type")
9 non-validated labels are not registered with CellType.name: ['Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+']!
      → to lookup categories, use .lookup().['cell_type']
      → to register, set validated_only=False

Let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_public(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)
Hide code cell output
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

validator.validate()
💡 inspecting 'variables' by Gene.ensembl_gene_id
✅    all variabless are validated
💡 inspecting 'cell_type' by CellType.name
✅    all cell_types are validated
True

Register #

artifact = validator.register_artifact(description="10x reference adata")
💡    path content will be copied to default storage upon `save()` with key `None` ('.lamindb/cD1MwWlcJMDdjKYODq2Y.h5ad')
✅    storing artifact 'cD1MwWlcJMDdjKYODq2Y' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/cD1MwWlcJMDdjKYODq2Y.h5ad'
💡    parsing feature names of X stored in slot 'var'
754 terms (100.00%) are validated for ensembl_gene_id
✅    linked: FeatureSet(uid='XLsQYrWC54iwHYMFvtsT', n=754, type='number', registry='bionty.Gene', hash='j8QkIeLBgJwsscY4vVPx', created_by_id=1)
💡 parsing feature names of slot 'obs'
1 term (25.00%) is validated for name
3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
✅    linked: FeatureSet(uid='PUm82besGePEuqcruz5h', n=1, registry='core.Feature', hash='vyi9dafdTLuO6cTBncnM', created_by_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
✅ registered artifact in testuser1/test-scrna
artifact.view_lineage()
_images/705d4192ce010df7ba7e1b51c7c59b0ddef452ad7e01a7c59d24eeb344208f4c.svg

Append the shard to the collection#

Query the previous collection:

collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.artifact],
    is_new_version_of=collection_v1,
)
collection_v2.save()
collection_v2.labels.add_from(artifact)
collection_v2.labels.add_from(collection_v1)
Hide code cell output
✅ loaded: FeatureSet(uid='c5eCX4WUUMsPX69Enzc4', n=4, registry='core.Feature', hash='taDIZs0vy7CqVljgeqVZ', updated_at=2024-03-26 12:02:39 UTC, created_by_id=1)
💡 adding collection [1] as input for run 2, adding parent transform 1
💡 adding artifact [1] as input for run 2, adding parent transform 1
✅ saved 1 feature set for slot: 'var'
💡 transferring cell_type
💡 transferring donor
💡 transferring tissue
💡 transferring cell_type
💡 transferring assay

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()
Collection(uid='IXzSUQjHF6tJklaahF5o', name='My versioned scRNA-seq collection', version='2', hash='HNR3VFV60_yqRnUka11E', visibility=1, updated_at=2024-03-26 12:03:09 UTC)

Provenance:
  💫 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', key='scrna2', version='1', type=notebook, updated_at=2024-03-26 12:02:46 UTC, created_by_id=1)
  👣 run: Run(uid='0SDPuvwMMvHbz8XZ6Kxl', started_at=2024-03-26 12:02:46 UTC, is_consecutive=True, transform_id=2, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-03-26 12:00:16 UTC)
Features:
  var: FeatureSet(uid='JEqpMZS8zoJ3MS1sRj3e', n=36508, type='number', registry='bionty.Gene', hash='b5NMddLHEyZqn-vSYvBI', updated_at=2024-03-26 12:03:08 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='c5eCX4WUUMsPX69Enzc4', n=4, registry='core.Feature', hash='taDIZs0vy7CqVljgeqVZ', updated_at=2024-03-26 12:02:39 UTC, created_by_id=1)
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 cell_type (40, bionty.CellType): 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'CD4-positive, alpha-beta T cell', 'classical monocyte', 'T follicular helper cell', ...
    🔗 assay (3, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1'
Labels:
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (40, bionty.CellType): 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'CD4-positive, alpha-beta T cell', 'classical monocyte', 'T follicular helper cell', ...
  🏷️ experimental_factors (3, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...

View data lineage:

collection_v2.view_lineage()
_images/85bb05895339e494572a8ce6c672297a6b2bc4c4ccf6195c4c9799e9c70372c5.svg