A scanpy extension to analyse single-cell TCR and BCR data.

Last update: Jan 03, 2023

Related tags

Overview

Scirpy: A Scanpy extension for analyzing single-cell immune-cell receptor sequencing data

Scirpy is a scalable python-toolkit to analyse T cell receptor (TCR) or B cell receptor (BCR) repertoires from single-cell RNA sequencing (scRNA-seq) data. It seamlessly integrates with the popular scanpy library and provides various modules for data import, analysis and visualization.

Getting started

Please refer to the documentation. In particular, the

Tutorial, and the
API documentation.

In the documentation, you can also learn more about our immune-cell receptor model.

Case-study

The case study from our preprint is available here.

Installation

You need to have Python 3.7 or newer installed on your system. If you don't have Python installed, we recommend installing Miniconda.

There are several alternative options to install scirpy:

Install the latest release of scirpy from PyPI:

pip install scirpy

Get it from Bioconda:

conda install -c conda-forge -c bioconda scirpy

Install the latest development version:

pip install git+https://github.com/icbi-lab/[email protected]

Run it in a container using Docker or Podman:

docker pull quay.io/biocontainers/scirpy:<tag>

where tag is one of these tags.

Support

We are happy to assist with problems when using scirpy. Please report any bugs, feature requests, or help requests using the issue tracker. We try to respond within two working days, however fixing bugs or implementing new features can take substantially longer, depending on the availability of our developers.

Release notes

See the release section.

Contact

Please use the issue tracker.

Citation

Sturm, G. Tamas, GS, ..., Finotello, F. (2020). Scirpy: A Scanpy extension for analyzing single-cell T-cell receptor sequencing data. Bioinformatics. doi:10.1093/bioinformatics/btaa611

Comments

Vdj plot - [merged]

In GitLab by @szabogtamas on Mar 25, 2020, 19:57

Merges vdj_plot -> master

Added a much faster version of #24
Fixes #24.

Test cases need to be added yet.

opened by grst 70

Issues with installing SCIRPY

I am trying to install Scirpy using anaconda/Jupyter on my desktop window.

**when I try this: conda install -c conda-forge -c bioconda scirpy

I go the error message:**

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

Note: you may need to restart the kernel to use updated packages.


Building graph of deps:   0%|          | 0/2 [00:00<?, ?it/s]
Examining python=3.8:   0%|          | 0/2 [00:00<?, ?it/s]  
Examining scirpy:  50%|#####     | 1/2 [00:00<00:00,  2.94it/s]
Examining scirpy: 100%|##########| 2/2 [00:00<00:00,  5.88it/s]
                                                               

Determining conflicts:   0%|          | 0/2 [00:00<?, ?it/s]
Examining conflict for python scirpy:   0%|          | 0/2 [00:00<?, ?it/s]
                                                                           

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

THEN I tried this: pip install scirpy, I got another error message:

ERROR: Command errored out with exit status 1:
   command: 'C:\Users\tpeng\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"'; __file__='"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\tpeng\AppData\Local\Temp\pip-wheel-3n51vnw0'
       cwd: C:\Users\tpeng\AppData\Local\Temp\pip-install-jpyjqkeg\python-levenshtein\
  Complete output (27 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.8
  creating build\lib.win-amd64-3.8\Levenshtein
  copying Levenshtein\StringMatcher.py -> build\lib.win-amd64-3.8\Levenshtein
  copying Levenshtein\__init__.py -> build\lib.win-amd64-3.8\Levenshtein
  running egg_info
  writing python_Levenshtein.egg-info\PKG-INFO
  writing dependency_links to python_Levenshtein.egg-info\dependency_links.txt
  writing entry points to python_Levenshtein.egg-info\entry_points.txt
  writing namespace_packages to python_Levenshtein.egg-info\namespace_packages.txt
  writing requirements to python_Levenshtein.egg-info\requires.txt
  writing top-level names to python_Levenshtein.egg-info\top_level.txt
  reading manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  warning: no previously-included files matching '*pyc' found anywhere in distribution
  warning: no previously-included files matching '*so' found anywhere in distribution
  warning: no previously-included files matching '.project' found anywhere in distribution
  warning: no previously-included files matching '.pydevproject' found anywhere in distribution
  writing manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
  copying Levenshtein\_levenshtein.c -> build\lib.win-amd64-3.8\Levenshtein
  copying Levenshtein\_levenshtein.h -> build\lib.win-amd64-3.8\Levenshtein
  running build_ext
  building 'Levenshtein._levenshtein' extension
  error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
  ----------------------------------------
  ERROR: Failed building wheel for python-levenshtein
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\tpeng\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"'; __file__='"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\tpeng\AppData\Local\Temp\pip-record-spj3ycwj\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\tpeng\Anaconda3\Include\python-levenshtein'
         cwd: C:\Users\tpeng\AppData\Local\Temp\pip-install-jpyjqkeg\python-levenshtein\
    Complete output (27 lines):
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.8
    creating build\lib.win-amd64-3.8\Levenshtein
    copying Levenshtein\StringMatcher.py -> build\lib.win-amd64-3.8\Levenshtein
    copying Levenshtein\__init__.py -> build\lib.win-amd64-3.8\Levenshtein
    running egg_info
    writing python_Levenshtein.egg-info\PKG-INFO
    writing dependency_links to python_Levenshtein.egg-info\dependency_links.txt
    writing entry points to python_Levenshtein.egg-info\entry_points.txt
    writing namespace_packages to python_Levenshtein.egg-info\namespace_packages.txt
    writing requirements to python_Levenshtein.egg-info\requires.txt
    writing top-level names to python_Levenshtein.egg-info\top_level.txt
    reading manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no previously-included files matching '*pyc' found anywhere in distribution
    warning: no previously-included files matching '*so' found anywhere in distribution
    warning: no previously-included files matching '.project' found anywhere in distribution
    warning: no previously-included files matching '.pydevproject' found anywhere in distribution
    writing manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
    copying Levenshtein\_levenshtein.c -> build\lib.win-amd64-3.8\Levenshtein
    copying Levenshtein\_levenshtein.h -> build\lib.win-amd64-3.8\Levenshtein
    running build_ext
    building 'Levenshtein._levenshtein' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\Users\tpeng\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"'; __file__='"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\tpeng\AppData\Local\Temp\pip-record-spj3ycwj\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\tpeng\Anaconda3\Include\python-levenshtein' Check the logs for full command output.

  Stored in directory: c:\users\tpeng\appdata\local\pip\cache\wheels\8c\51\cb\423184c62cc06c302d2f54f9853e5acee6c15a3b04d49a5eb3
  Building wheel for python-levenshtein (setup.py): started
  Building wheel for python-levenshtein (setup.py): finished with status 'error'
  Running setup.py clean for python-levenshtein
  Building wheel for yamlordereddictloader (setup.py): started
  Building wheel for yamlordereddictloader (setup.py): finished with status 'done'
  Created wheel for yamlordereddictloader: filename=yamlordereddictloader-0.4.0-py3-none-any.whl size=4058 sha256=05980b7e37960621917874dd26c59adf1eb9d304ca1f21d7c81d38adc8bc2674
  Stored in directory: c:\users\tpeng\appdata\local\pip\cache\wheels\50\9a\6f\9cb3312fd9cd01ea93c3fdc1dbee95f5fa0133125d4c7cb09a
Successfully built airr yamlordereddictloader
Failed to build python-levenshtein
Installing collected packages: yamlordereddictloader, airr, python-levenshtein, squarify, parasail, pytoml, scirpy
    Running setup.py install for python-levenshtein: started
    Running setup.py install for python-levenshtein: finished with status 'error'

question

opened by taopeng1100 40

List of plots [REPLACEMENT ISSUE]
The original issue

Id: 9 Title: List of plots

could not be created. This is a dummy issue, replacing the original one. It contains everything but the original issue description. In case the gitlab repository is still existing, visit the following link to show the original issue:

TODO
opened by grst 38
Memory usage ir_neighbors

Dear authors,

Currently, the ir_neighbors algorithm consumes over 100G of memory of data from 55k cells, causing memory failures on the server. Is there any way to limit memory consumption and is this normal behavior, which parameters can I adapt to control memory usage ?

Kind Regards,

opened by vladie0 27
Support for BCR
BCR support meta issue.

Initial PR #183 addresses:

[x] change datastructure

Instead of TRA_1, TRA_2, TRB_1 and TRB_2 have arm1_primary, arm1_secondary, arm2_primary and arm2_secondary

have additional 4 columns: arm1_primary_type, ...; These can accept values such as TRA, TRB, TRG, TRD, IGH, IGL (see AIRR locus names)

[x] adapt chain_categories to identify bona fide vs other pairings (e.g. flag a cell with TRA + IGH).

[x] clonotype network: separate by receptor_type (per default, no connections between BCR and TCR)

[x] update glossary and documentation. Make clear that now there are VJ and VDJ chains. (mostly done)

[x] rename tcr_ to cdr3_ or receptor_ or vdj_ or ir_

To be resolved before next release (v0.5 "adding experimental BCR support")

[x] #194 vdj_usage is broken with new data structure

[x] #195 improve BCR-related documentation

[x] read_tracer() with gamma/delta

[x] read_bracer()

[x] #198 Add BCR example dataset

BCR-related issues that can be resolved at a later point

[ ] #197 function to infer antibody class

[ ] #196 function to infer somatic hypermutation status

[ ] #199 BCR-tutorial

[ ] add support for CDR1 and 2 (#185)
opened by grst 27
Find better name

In GitLab by @grst on Mar 20, 2020, 18:16

The current one, sctcrpy is hard to pronounce and remember.

Also it would be nice if the name left the option to expand to BCRs later on.

imm, sc, py, receptor, cr, ... ??

opened by grst 21
Cannot convert output from Scirpy to dandelion
Description of the bug

I try to convert the AnnData after clonal assignment by Scirpy into dandelion format. The result showed that "field productive has invalid bool T + T". I would like to convert it for updating germline sequence of each BCR sequences using dandelion because I did not find this function in Scirpy. However, if you could suggest the other ways. Feel free to let me know.

Minimal reproducible example

import scirpy as ir ABC_irdata_exclude_orphan_dandelion = ir.io.to_dandelion(ABC_irdata_exclude_orphan) ABC_irdata_exclude_orphan_dandelion

The error message produced by the code above

~/.conda/envs/dandelion/lib/python3.8/site-packages/airr/schema.py in validate_row(self, row) 276 if spec == 'number': self.to_float(row[f], validate=True) 277 except ValidationError as e: --> 278 raise ValidationError('field %s has %s' %(f, e)) 279 280 return True ValidationError: field productive has invalid bool T + T

Version information

versions

bug
opened by sbenjamaporn 19
Ranking genes between specific clusters in clonotype network

Dear ICBI lab,

I have some questions regarding the Scirpy package, I would really appreciate your help. For my analyses I first merged TCR and transcriptomics data from two different samples (organ 1 and organ 2), subsequently I then merged these two files.

1. a) If using the Scirpy clonotype_network tool the 'sequence' is set to 'nt', are the clusters in the clonotype network (where each node represents a cell) based on identical nucleotide sequences, or is it based on similarity, meaning that one cluster could consists of cells with slightly different nucleotide sequences?

As I understand, each node represents a cell, and edges connect cells belonging to the same clonotype. The function makes visualization of the clonotype-network possible, analogous to the construction of a neighborhood graph from transcriptomics data with the Scanpy package, so that based on the above, it computes a neighborhood graph of CDR3 nucleotide sequences with "scirpy.pp.tcr_neighbors())". However answer to my question I couldn't find, I was hoping you can help me out?

b) Do the lines that connect the cells within a cluster have any meaning?

c) In the plot above, is it correct that the closer the different clusters are together, the more similar their nucleotide sequences are? Meaning that the sequences of the clonotypes consisting of only 2 cells in this case are most different from the clonotype clusters consisting of >5 cells (as they are further apart)?

2. a) The package allows for specifying what organs the clusters consist of:

My question is: how can I select specific clusters in the clonotype network graph (I only would like to include the clonotypes in the network that have identical clonotypes between the two different samples from my data, i.e. organ 1 and organ 2), meaning that in the plot above I would like to filter for the clusters only that display both blue and orange nodes.

b) How can I assign numbers to my clusters based on identical nucleotide sequences shared between two samples?

c) How can I add a legend to my plot, and how can I change the name 'batch' into 'sample'?

3. Using Scirpy, how can one best specify differentially expressed genes (based on the transcriptomics data) between the different clusters based on shared nucleotide sequences between blood and fat, for example cluster 1 vs. rest of the clusters, and cluster 1 and 2 vs. cluster 3 and 4? I have tried implementing Scanpy's tool to rank genes using WIlcoxon, but unfortunately I can't make It work.

sc.tl.rank_genes_groups(adata, 'clonotype', groups=['1','2'], reference=['3','4'], method='wilcoxon') sc.pl.rank_genes_groups(adata, groups=['1', '2'], n_genes=20)

-The key of the observations grouping to be considered would be: "clonotype clusters based on share identical nucleotide sequences between organ 1 and organ 2". -Subset of groups to which comparison wouldl be restricted: "clonotype 1 and clonotype 2" -Comparison: Compare with respect to a specific group -Group identifier with respect to which compare: "clonotype 3 and clonotype 4" “The number of genes that appear in the returned tables”: 100 “Method”: Wilcoxon-Rank-Sum

Thanks in advance,

Josine
question

opened by josinejansen 19
BD Rhapsody data import

I have a question about BD Rhapsody CDR3 VDJ data import. Following scirpy introduction, imported data object is very similar with BD VDJ data.

If you have an interest wiht BD VDJ data, please check the possibility of BD VDJ data import into scirpy. I can share my data to you.

opened by wajm 18
Plot overhaul - [merged]
In GitLab by @grst on Mar 2, 2020, 14:21

Merges feat/plot-overhaul -> master

This PR aims at addressing issues in %"plot overhaul".

[x] Simplify and restructure tools (Don't add to uns when inexpensive, Fixes #25)

[x] Rudimentary support for figure themes (atm, only a default theme is supported, but can be extended easily, Fixes #18)

[ ] Simplify and restructure plots

Plot checklist: Make sure every plotting function

[x] accepts

styling kwargs

ax object

(this is implicitly given by kwargs forwarding. Maybe requires better documentation)

[x] returns ax object

[x] has sensible defaults for ax labelling and title
opened by grst 17
List of Tools
In GitLab by @grst on Jan 30, 2020, 12:59

Tools are functions that work with the data parsed from 10x/tracer and add either

new columns to obs

new matrices to obsm (e.g. distance matrices)

other summary data to uns.

They are usually required as an additional processing step before running certain plotting functions. Here's a list of tools we want to implement.

@szabogtamas, feel free to add to/edit the list.

List of tools

[x] st.tl.define_clonotypes(adata) assignes clonotypes to cells based on their CDR3 sequences

[x] st.tl.tcr_dist(adata, chains=["TRA_1, "TRB_1"], combination=np.min) adds TCR dist to obsm (#11)

[x] st.tl.kidera_dist adds Kidera distances to obsm

[x] st.tl.chain_convergence(adata, groupby) adds column to obs that contains the number of nucleotide versions for each CDR3 AA sequence

[x] st.tl.alpha_diversity(adata, groupby, diversityforgroup) Now we were only thinking about calculating diversity of clonotypes in different groups. But the diversity of any group could just as well be calculated.

[ ] st.tl.sequence_logos(adata, ?forgroup?) Precompute MSAs and sequence logos for plotting with st.pl.sequence_logos.

[ ] st.tl.dendrogram(adata, groupby) Compute a dendrogram on an arbitrary distance matrix (e.g. from tcr_dist).

Needs discussion

[ ] st.tl.create_group(group_membership={Group1: ['barcode1', barcode2']} adds a group membership to each cell by adding a column to obsm and the name of the grouping to a list in uns (by default, groups based on samples, V gene usage and even clonotypes could be created at initial run); might call chain_convergence and alpha_diversity functions to calculate these measures right when creating a group

Ideas, might be implemented at later stage

[ ] Shared Kmers

[ ] GLIPH

[ ] Chains recognizing the same eiptopes based on McPAS-TCR

[ ] epitope reactivity -> query external database

[ ] tcellmatch (Fischer, Theis et al. )
opened by grst 17
antigen specificity prediction
Description of feature

The first reasonable methods to predict antigen specificity emerge, e.g.

ERGO-II (https://www.frontiersin.org/articles/10.3389/fimmu.2021.664514/full, https://github.com/IdoSpringer/ERGO-II)

These are conceptually different from querying databases through sequence distance metrics or autoencoders, as they do not simply model the sequence similarity, but explicitly model the specificity.

Would be nice to call them directly from scirpy.

@FFinotello, potentially another good student task.
opened by grst 0

Scalability to >1M cells

Description of feature

I have been playing with omniscope's COVID dataset that provides 8M TCR receptors. By doing so, I identified several bottlenecks that make working with >1M cells in scirpy painful or impossible.

This meta issue is to give an overview of the progress improving scirpy's scalability.

graph TB
    subgraph legend
         legend1(could be faster -- minutes)
         OK(OK -- seconds)
         legend2(prohibitively slow -- hours)
         legend3(not profiled yet)
         style legend1 stroke:#ff7f00
         style OK stroke:#4daf4a
         style legend2 stroke:#e41a1c
    end

graph TB
    subgraph preprocessing
      IO --> QC
      QC --> dist_id[ir_dist identity]
      QC --> dist_levenshtein[ir_dist levenshtein]
      QC --> dist_alignment[ir_dist alignment]
      dist_id --> define_clonotypes
      dist_levenshtein --> define_clonotypes
      dist_alignment --> define_clonotypes
      define_clonotypes --> clonotypes
      QC -.-> autoencoder
      autoencoder -.-> clonotypes
      autoencoder -.-> define_clonotypes

      clonotypes[(CLONOTYPES)]
      
      style IO stroke:#ff7f00
      style QC stroke:#4daf4a
      style dist_id stroke:#4daf4a
      style define_clonotypes stroke:#e41a1c
      style dist_levenshtein stroke:#e41a1c
      style dist_alignment stroke:#e41a1c
      style clonotypes stroke:white
   end
   
   subgraph downstream
      clonotypes --> clonotype_network
      clonotypes --> other[other tools]
   end

Action items

data structure (#356). The foundation for other changes. Might also speed up saving the anndata object.
reading data (#367). User experience can be improved, but not a top priority atm.
ir_dist (#304). Needs more scalable methods for computing sequence distances.
define_clonotypes (#368). At the very least needs a better parallelization. Maybe there's room for some jax/numba.
autoencoder-based embedding (#369). Possible alternative to ir_dist. Maybe it even makes sense to combine ir_dist and define_clonotypes into a single step.

opened by grst 0

Autoencoder-based sequence embedding
Description of feature

IMO autoencoder-based sequence embedding has a huge potential for finding similar immune receptors, potentially improving both the speed and the accuracy compared to alignment-based metrics. In particular, finding similar sequences is important in two scirpy functions:

defining clonotypes

querying immune receptor databases.

For the database query, an online-update algorithm similar to scArches for gene expression would be nice: The autoencoder could be trained on the database (which might have millions of unique receptors) once. A new dataset (which might only have 10k-100k unique receptors), could be projected into the same latent space as the database, significantly improving query time.

An extension to this idea is to embed gene expression and TCR/BCR data into the same latent space.

Existing tools

Trex by @ncborcherding. Based on keras.

mvTCR by @b-schubert's lab. Combines receptor/Gex data. Based on pytorch.

TESSA. Combines receptor/Gex data. Not even sure it's an autoencoder, need yet to check in detail, but it seems to use some clever sequence embeddings.

There are likely more...

@drEast mentioned he is working on something like that a few months ago. Are you willing to share a few details and if you would be interested in integrating it with scirpy? @adamgayoso, any chance there's AirrVI soon? :stuck_out_tongue_winking_eye:
opened by grst 7
speed up define_clonotypes
Description of feature

The define_clonotypes function scales badly. There are two problems with it

it could be faster (while it relies heavily on numpy, there are parts implemented in Python)

parallelization doesn't work properly with large data. Due to how multiprocessing is implemented in Python, parallelization involves a lot of copying. If parallelization worked properly, the speed would still be bearable if one throws enough cores at the problem.

Where's the bottleneck of the function?

INPUT:

2 distance matrices, one for unique VJ sequences, one for unique VDJ sequences

OUTPUT:

a clonotype id for each cell

CURRENT IMPLEMENTATION:

compute unique receptor configurations (i.e. combining cells with the same sequences into a single entry) (fast)

build a lookup table from which the neighbors of each cell can be retrieved (fast enough)

loop through all unique receptor configurations and find neighbors (SLOW)

build a distance matrix (fast)

graph partition using igraph (fast)

ALTERNATIVE IMPLEMENTATIONS I considered but discarded

reindexing sequence distance matrices such that they match the table of unique receptor configurations

Then perform matrix operations to combine primary/secondary and TRA/TRB matrices.

The problem with this approach is that large dense blocks in the sparse matrices can arise if many unique receptors have the same sequence (e.g. same TRA but different TRB).

Possible solutions

fix parallelization (shared memory)

reimplement using jax/numba (this may also solve the parallelization and provide GPU support)

Combine 2-4 into a single step (maybe possible with sequence embedding -- see #369 ). Note that this would be an alternative route and wouldn't replace ir_dist/define_clonotypes completely.

Special-casing: In the case of omniscope data (which only has TRB chains), the problem simplifies to reindexing a sparse matrix. If using only one pair of sequences per cell, the problem is likely also simpler.
opened by grst 0
Speed up read_airr
Description of feature

Loading AIRR data with 1.5M rows takes ~10 minutes. This is not too bad, but could be made less annoying

Make validation optional (I expect the validation of the airr implementation takes a good chunk of that time)

Parallelization (read different parts of the file in parallel - or first read into pandas, parse several chunks of the dataframe in parallel)

Progress bar (this at least shows that this will eventually finish)
opened by grst 0

Releases(v0.11.2)

v0.11.2(Nov 20, 2022)
Fixes

Excluded broken python-igraph version (#366)

Source code(tar.gz)
Source code(zip)
v0.11.1(Aug 18, 2022)
Fixes

Solve incompatibility with scipy v1.9.0 (#360)

Internal changes

do not autodeploy docs via CI (currently broken)

updated patched version of scikit-learn

Source code(tar.gz)
Source code(zip)
v0.11.0(Jul 5, 2022)
Additions

Add data loader for BD Rhapsody single-cell immune-cell receptor data (io.read_bd_rhapsody) (#351)

Fixes

Fix type conversions in from_dandelion (#349).

Update minimal dandelion version

Documentation

Rebranding to scverse (#324, #326)

Add issue templates

Fix IMGT typos (#344 by @emjbishop)

Internal changes

Bump default CI python version to 3.9

Use patched version of scikit-bio in CI until https://github.com/biocore/scikit-bio/pull/1813 gets merged

Source code(tar.gz)
Source code(zip)
v0.10.1(Nov 22, 2021)
Fixes

Fix bug in cellranger import (#310 by @ddemaeyer)

Fix that VDJDB download failed when cache dir was not present (#311)

Source code(tar.gz)
Source code(zip)
v0.10.0(Nov 15, 2021)
Additions

This release adds a new feature to query reference databases (#298) comprising

an extension of pp.ir_dist to compute distances to a reference dataset,

tl.ir_query, to match immune receptors to a reference database based on the distances computed with ir_dist,

tl.ir_query_annotate and tl.ir_query_annotate_df to annotate cells based on the result of tl.ir_query, and

datasets.vdjdb which conveniently downloads and processes the latest version of VDJDB.

Fixes

Bump minimal dependencies for networkx and tqdm (#300)

Fix issue with repertoire_overlap (Fix #302 via #305)

Fix issue with define_clonotype_clusters (Fix #303 via #305)

Suppress FutureWarnings from pandas in tutorials (#307)

Internal changes

Update sphinx to >= 4.1 (#306)

Update black version

Update the internal folder structure: tl, pp etc. are now real packages instead of aliases

Source code(tar.gz)
Source code(zip)
v0.9.1(Sep 24, 2021)
Fixes

Scirpy can now import additional columns from Cellranger 6 (#279 by @naity)

Fix minor issue with include_fields in AirrCell (#297)

Documentation

Fix broken link in README (#296)

Add developer documentation (#294)

Source code(tar.gz)
Source code(zip)
v0.9.0(Sep 7, 2021)
Additions

Add the new "clonotype modularity" tool which ranks clonotypes by how strongly connected their gene expression neighborhood graph is. (#282).

The below example shows three clonotypes (164, 1363, 942), two of which consist of cells that are transcriptionally related.

example clonotypes clonotype modularity vs. FDR

Deprecations

tl.clonotype_imbalance is now deprecated in favor of the new clonotype modularity tool.

Fixes

Fix calling locus from gene name in some cases (#288)

Compatibility with networkx>=2.6 (#292)

Minor updates

Fix some links in README (#284)

Fix old instances of clonotype in docs (should be clone_id) (#287)

Source code(tar.gz)
Source code(zip)
v0.8.0(Jul 22, 2021)
Additions

tl.alpha_diversity now supports all metrics from scikit-bio, the D50 metric and custom callback functions (#277 by @naity)

Fixes

Handle input data with "productive" chains which don't have a junction_aa sequence annotated (#281)

Fix issue with serialized "extra chains" not being imported correctly (#283 by @zktuong)

Minor changes

The CI can now build documentation from pull-requests from forks. PR docs are not deployed to github-pages anymore, but can be downloaded as artifact from the CI run.

Source code(tar.gz)
Source code(zip)
v0.7.1(Jul 2, 2021)
Fixes

Ensure Compatibility with latest version of dandelion (e78701c)

Add links to older versions of documentation (#275)

Fix issue, where clonotype analysis couldn't be continued after saving and reloading h5ad object (#274)

Allow "None" values to be present as cell-level attributes during merge_airr_chains (#273)

Minor changes

Require anndata >= 0.7.6 in conda tests (#266)

Source code(tar.gz)
Source code(zip)
v0.7.0(Apr 28, 2021)
This update features a

change of Scirpy's data structure to improve interoperability with the AIRR standard

a complete re-write of the clonotype definition module for improved performance.

This required several backwards-incompatible changes. Please read the release notes below and the updated tutorials.

Backwards-incompatible changes

Improve Interoperability by fully supporting the AIRR standard (#241)

Scirpy stores receptor information in adata.obs. In this release, we updated the column names to match the AIRR Rearrangement standard. Our data model is now much more flexible, allowing to import arbitrary immune-receptor (IR)-chain related information. Use scirpy.io.upgrade_schema() to update existing AnnData objects to the latest format.

Closed issues #240, #253, #258, #255, #242, #215.

This update includes the following changes:

IrCell is now replaced by AirrCell which has additional functionality

IrChain has been removed. Use a plain dictionary instead.

CDR3 information is now read from the junction and junction_aa columns instead of cdr3_nt and cdr3, respectively.

Clonotype assignments are now per default stored in the clone_id column.

expr and expr_raw are now duplicate_count and consensus_count.

{v,d,j,c}_gene is now {v,d,j,c}_call.

There's now an extra_chains column containing all IR-chains that don't fit into our receptor model. These chains are not used by scirpy, but can be re-exported to different formats.

merge_with_ir is now split up into merge_with_ir (to merge IR data with transcriptomics data) and merge_airr_chains (to merge several adatas with IR information, e.g. BCR and TCR data).

Tutorial and documentation updates, to reflect these changes

Sequences are not converted to upper case on import. Scirpy tools that consume the sequences convert them to upper case on-the-fly.

{to,from}_ir_objs has been renamed to {to,from}_airr_cells.

Refactor CDR3 network creation (#230)

Previously, pp.ir_neighbors constructed a cell x cell network based on clonotype similarity. This led to performance issues with highly expanded clonotypes (i.e. thousands of cells with exactly the same receptor configuration). Such cells would form dense blocks in the sparse adjacency matrix (see issue #217). Another downside was that expensive alignment-distances had to be recomputed every time the parameters of ir_neighbors was changed.

The new implementation computes distances between all unique receptor configurations, only considering one instance of highly expanded clonotypes.

Closed issues #243, #217, #191, #192, #164.

This update includes the following changes:

pp.ir_neighbors has been replaced by pp.ir_dist.

The options receptor_arms and dual_ir have been moved from pp.ir_neighbors to tl.define_clonotypes and tl.define_clonotype_clusters.

The default key for clonotype clusters is now cc_{distance}_{metric} instead of ct_cluster_{distance}_{metric}.

same_v_gene now fully respects the options dual_ir and receptor_arms

v-genes and receptor types were previously simply appended to clonotype ids (when same_v_gene=True). Now clonotypes with different v-genes get assigned a different numeric id.

Distance metric classes have been moved from ir_dist to ir_dist.metrics.

Distances matrices generated by ir_dist are now square and symmetric instead of triangular.

The default value for dual_ir is now any instead of primary_only (Closes #164).

The API of clonotype_network has changed.

Clonotype network now visualizes cells with identical receptor configurations. The number of cells with identical receptor configurations is shown as point size (and optionally, as color). Clonotype network does not support plotting multiple colors at the same time any more.

| Clonotype network (previous implementation) | Clonotype network (now) | | ------------------------------------------------------------|------------------| | Each dot represents a cell. Cells with identical receptors form a fully connected subnetwork | Each dot represents cells with identical receptors. The dot size refers to the number of cells | | |

Drop Support for Python 3.6

Support Python 3.9, drop support for Python 3.6, following the numpy guidelines. (#229)

Fixes

tl.clonal_expansion and tl.clonotype_convergence now respect cells with missing receptors and return nan for those cells. (#252)

Additions

util.graph.igraph_from_sparse_matrix allows to convert a sparse connectivity or distance matrix to an igraph object.

ir_dist.sequence_dist now also works sequence arrays that contain duplicate entries (#192)

from_dandelion and to_dandelion facilitate interaction with the Dandelion package (#240)

write_airr allows to write scirpy's adata.obs back to the AIRR Rearrangement format.

read_airr now tries to infer the locus from gene names, if no locus column is present.

ir.io.upgrade_schema allows to upgrade an existing scirpy anndata object to be compatible with the latest version of scirpy

define_clonotypes and define_clonotype_clusters now prints a logging message indicating where the results have been stored (#215)

Minor changes

tqdm now uses IPython widgets to display progress bars, if available

the process_map from tqdm is now used to display progress bars for parallel computations instead the custom implementation used previously f307c2b

matplotlibs "grid lines" are now suppressed by default in all plots.

Docs from the master branch are now deployed to icbi-lab.github.io/scirpy/develop instead of the main documentation website. The main website only gets updated on releases.

Refactored the _is_na function that checks if a string evaluates to None.

Fixed outdated documentation of the receptor_arms parameter (#264)

Source code(tar.gz)
Source code(zip)
v0.6.1(Jan 30, 2021)
Fixes

Fix an issue where define_clonotype failed when the clonotype network had no edges (#236).

Require pandas >= 1.0 and fix a pandas incompatibility in merge_with_ir (#238).

Ensure consistent order of the spectratype dataframe (#238).

Minor changes

Fix missing bibtex_bibfiles option in sphinx configuration

Work around https://github.com/takluyver/flit/issues/383.

Source code(tar.gz)
Source code(zip)
v0.6.0(Dec 10, 2020)
Backwards-incompatible changes:

Set more sensible defaults the the cutoff parameter in ir_neighbors. The default is now 2 for hamming and levenshtein distance metrics and 10 for the alignment distance metric.

Additions:

Add Hamming-distance as additional distance metric for ir_neighbors (#216 by @ktpolanski)

Minor changes:

Fix MacOS CI (#221)

Use mamba instead of conda in CI (#216)

Source code(tar.gz)
Source code(zip)
v0.5.0(Oct 20, 2020)
Add support for BCRs and gamma-delta TCRs

Backwards-incompatible changes:

The data structure has changed. Column have been renamed from TRA_xxx and TRB_xxx to IR_VJ_xxx and IR_VDJ_xxx. Additionally a locus column has been added for each chain.

All occurences of tcr in the function and class names have been replaced with ir. Aliases for the old names have been created and emit a FutureWarning.

Additions:

There's now a mixed TCR/BCR example dataset (maynard2020) available (#211)

BCR-related amendments to the documentation (#206)

tl.chain_qc which supersedes chain_pairing. It additionally provides information about the receptor type.

io.read_tracer now supports gamma-delta T-cells (#207)

io.to_ir_objs allows to convert adata to a list of IrCells (#210)

io.read_bracer allows to read-in BraCeR BCR data. (#208)

The pp.merge_with_ir function now can handle the case when both the left and the right AnnData object contain immune receptor information. This is useful when integrating both TCR and BCR data into the same dataset. (#210)

Fixes:

Fix a bug in vdj_usage which has been triggered by the new data structure (#203)

Minor changes:

Removed the tqdm monkey patch, as the issue has been resolved upstream (#200)

Add AIRR badge, as scirpy is now certified to comply with the AIRR software standard v1. (#202)

Require pycairo >1.20 which provides a windows wheel, eliminating the CI problems.

Source code(tar.gz)
Source code(zip)
d0.1.0(Oct 20, 2020)

The assets of this release contain the example datasets compatible with v0.5.
Source code(tar.gz)
Source code(zip)
maynard2020.h5ad(876.51 MB)
wu2020.h5ad(537.28 MB)
wu2020_3k.h5ad(15.96 MB)
v0.4.2(Oct 1, 2020)
Include tests into main package (#189)

Source code(tar.gz)
Source code(zip)
wu2020.h5ad(537.28 MB)
wu2020_3k.h5ad(15.96 MB)
v0.4.1(Sep 30, 2020)
Fix pythonpublish CI action

Update black version (and code style, accordingly)

Changes for AIRR-complicance:

Add support level to README

Add Biocontainer instructions to README

Add a minimal test suite to be ran on conda CI

Source code(tar.gz)
Source code(zip)
v0.4(Aug 26, 2020)
Adapt tcr_dist to support second array of sequences (#166). This enables comparing CDR3 sequences against a list of reference sequences.

Add tl.clonotype_convergence which helps to find evidence of convergent evolution (#168)

Optimize parallel sequence distance calculation (#171). There is now less communication overhead with the worker processes.

Fixed an error when runing pp.tcr_neighbors (#177)

Improve packaging. Use setuptools_scm instead of get_version. Remove redundant metadata. (#180). More tests for conda (#180).

Source code(tar.gz)
Source code(zip)
v0.3(Jun 5, 2020)
More extensive CI tests (now also testing on Windows, MacOS and testing the conda recipe) (#136, #138)

Add example images to API documentation (#140)

Refactor IO to expose TcrCell and TcrChain (#139)

Create data loading tutorial (#139)

Add a progressbar to TCR neighbors (#143)

Move clonotype_network_igraph to tools (#144)

Add read_airr to support the AIRR rearrangement format (#147)

Add option to take v-gene into account during clonotype definition (#148)

Store colors in AnnData to ensure consistent coloring across plots (#151)

Divide define_clontoypes into define_clonotypes and define_clonotype_clusters (#152). Now, the user has to specify explicitly sequence and metric for both tl.tcr_neighbors, tl.define_clonotype_clusters and tl.clonotype_network. This makes it more straightforward to have multiple, different versions of the clonotype network at the same time. The default parameters changed to sequence="nt" and `metric="identity" to comply with the traditional definition of clonotypes. The changes are also reflected in the glossary and the tutorial.

Update the workflow figure (#154)

Fix a bug that caused labels in the repertoire_overlap heatmap to be mixed up. (#157)

Add a label to the heatmap annotation in repertoire_overlap (#158).

Source code(tar.gz)
Source code(zip)
v0.2(May 22, 2020)
Documentation overhaul. A lot of docstrings got corrected and improved and the formatting of the documentation now matches scanpy's.

Experimental function to assess bias in clonotype abundance between conditions (#92)

Scirpy now has a logo (#123)

Update default parameters for clonotype_network:

Edges are now only automatically displayed if plotting < 1000 nodes

If plotting variables with many categories, the legend is hidden.

Update default parameters for alignment-based tcr_neighbors

The gap extend penalty now equals the gap open penalty (11).

Source code(tar.gz)
Source code(zip)
v0.1.2(Apr 15, 2020)
Make 10x csv and json import consistent (#109)

Fix version requirements (#112)

Fix compatibility issues with pandas > 1 (#112)

Updates to tutorial and README

Source code(tar.gz)
Source code(zip)
v0.1.1(Apr 10, 2020)
Update documentation about T-cell receptor model (#4, #10)

Update README

Fix curve plots (#31)

Host datasets on GitHub (#104)

Source code(tar.gz)
Source code(zip)
v0.1(Apr 10, 2020)

Initial release for pre-print
Source code(tar.gz)
Source code(zip)
wu2020.h5ad(537.53 MB)
wu2020_3k.h5ad(27.38 MB)

A scanpy extension to analyse single-cell TCR and BCR data.

Related tags

Overview

Scirpy: A Scanpy extension for analyzing single-cell immune-cell receptor sequencing data

Getting started

Case-study

Installation

Support

Release notes

Contact

Citation

Comments

BCR support meta issue.

Initial PR #183 addresses:

To be resolved before next release (v0.5 "adding experimental BCR support")

BCR-related issues that can be resolved at a later point

Description of the bug

Minimal reproducible example

The error message produced by the code above

Version information

List of tools

Needs discussion

Ideas, might be implemented at later stage

Description of feature

Description of feature

Action items

Description of feature

Existing tools

Description of feature

Where's the bottleneck of the function?

Possible solutions

Description of feature

Releases(v0.11.2)

v0.11.2(Nov 20, 2022)

Fixes

v0.11.1(Aug 18, 2022)

Fixes

Internal changes

v0.11.0(Jul 5, 2022)

Additions

Fixes

Documentation

Internal changes

v0.10.1(Nov 22, 2021)

Fixes

v0.10.0(Nov 15, 2021)

Additions

Fixes

Internal changes

v0.9.1(Sep 24, 2021)

Fixes

Documentation

v0.9.0(Sep 7, 2021)

Additions

Deprecations

Fixes

Minor updates

v0.8.0(Jul 22, 2021)

Additions

Fixes

Minor changes

v0.7.1(Jul 2, 2021)

Fixes

Minor changes

v0.7.0(Apr 28, 2021)

Backwards-incompatible changes

Improve Interoperability by fully supporting the AIRR standard (#241)

Refactor CDR3 network creation (#230)

Drop Support for Python 3.6

Fixes

Additions

Minor changes

v0.6.1(Jan 30, 2021)

Fixes

Minor changes

v0.6.0(Dec 10, 2020)

Backwards-incompatible changes:

Additions:

Minor changes:

v0.5.0(Oct 20, 2020)