Bulk2Space is a spatial deconvolution method based on deep learning frameworks

Overview

Bulk2Space

Spatially resolved single-cell deconvolution of bulk transcriptomes using Bulk2Space

python 3.8

Bulk2Space is a spatial deconvolution method based on deep learning frameworks, which converts bulk transcriptomes into spatially resolved single-cell expression profiles.

Image text

Installation

For bulk2space, the python version need is over 3.8. If you have installed Python3.6 or Python3.7, consider installing Anaconda, and then you can create a new environment.

conda create -n bulk2space python=3.8.5
conda activate bulk2space

cd bulk2space
pip install -r requirements.txt 

Usage

Run the demo data

If you choose the spatial barcoding-based data(like 10x Genomics or ST) as spatial reference, run the following command:

python bulk2space.py --project_name test1 --data_path example_data/demo1 --input_sc_meta_path demo1_sc_meta.csv --input_sc_data_path demo1_sc_data.csv --input_bulk_path demo1_bulk.csv --input_st_data_path demo1_st_data.csv --input_st_meta_path demo1_st_meta.csv --BetaVAE_H --epoch 10 --spot_data True

else, if you choose the image-based in situ hybridization data(like MERFISH, SeqFISH, and STARmap) as spatial reference, run the following command:

python bulk2space.py --project_name test2 --data_path example_data/demo2 --input_sc_meta_path demo2_sc_meta.csv --input_sc_data_path demo2_sc_data.csv --input_bulk_path demo2_bulk.csv --input_st_data_path demo2_st_data.csv --input_st_meta_path demo2_st_meta.csv --BetaVAE_H --epoch 10 --spot_data False

Run your own data

When using your own data, make sure

  • the bulk.csv file must contain one column of gene expression

    Sample
    Gene1 5.22
    Gene2 3.67
    ... ...
    GeneN 15.76
  • the sc_meta.csv file must contain two columns of cell name and cell type. Make sure the column names are correct, i.e., Cell and Cell_type

    Cell Cell_type
    Cell_1 Cell_1 T cell
    Cell_2 Cell_2 B cell
    ... ... ...
    Cell_n Cell_n Monocyte
  • the st_meta.csv file must contain at least two columns of spatial coordinates. Make sure the column names are correct, i.e., xcoord and ycoord

    xcoord ycoord
    Cell_1 / Spot_1 1.2 5.2
    Cell_2 / Spot_2 5.4 4.3
    ... ... ...
    Cell_n / Spot_n 11.3 6.3
  • the sc_data.csv and st_data.csv files are gene expression matrices

Then you will get your results in the output_data folder.

For more details, see user guide in the document.

About

Bulk2Space manuscript is under major revision. Should you have any questions, please contact Jie Liao at [email protected], Jingyang Qian at [email protected], or Yin Fang at [email protected]

Comments
  • Data availability

    Data availability

    Hey team, thanks for coming up with this useful tool. I'm looking to follow your tutorial on hypothalamus deconvolution, and it seems the lcm.gz data file on your Github only contains a single file, without all the processes count matrices and cell metadata table. Is that supposed to be the case? If so, I wonder how I should process this single file to generate the input data I need. Thanks for any heads up!

    opened by loganminhdang 6
  • Cannot locate the bulk2space.py script and directory after installation

    Cannot locate the bulk2space.py script and directory after installation

    Hi, I'm writing to seek your assistance on an issue I'm having. After installation of the conda environment, I cannot locate the bulk2space directory, which should contain the bulk2space python script to run the algorithm. The installation also seems incomplete, seeing that after I manually retrieve the python script from your Github page, I received the following error message: Traceback (most recent call last): File "bulk2space.py", line 2, in from utils.tool import * ModuleNotFoundError: No module named 'utils'

    I would appreciate any guidance. Thanks!

    opened by loganminhdang 5
  • Preproccessed PDAC data

    Preproccessed PDAC data

    Hello,

    I am trying to understand how to use bulk2space by going though the tutorials. I am currently going though the first tutorial with the PDAC datasets. I would like to know how you generated the preprocessed files "st_data" and "st_meta".

    I went to the original data from Moncada et al. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111672) but I don't know which files you used from there to make the above preprocessed files. Could you clarify that and explain a bit more in detail how you generated "st_data" and "st_meta"? This will be helpful to understand how to process other reference datasets.

    opened by AlexUOM 4
  • the question of

    the question of "quick start" section

    Dear professors, We are very sorry to bother you. We recently downloaded the bulk2space and used the test data of demo1, but we don't know why there are no result output, and we don't know whether the data are written normally. After operation, the Bulk2space-1.0.0-Py3.8egg displayed empty. Some information are as follows. I would wonder if you can help check it in your busy schedule or if there is any other step guidance. bulk2space

    opened by coconutll 2
  • Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

    Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

    Hi, thanks for coming up with this useful tool.  I have bulk RNAseq data and scRNA-seq data from the same patient which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. Here are my questions: 1.Why do I only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data? However, there are many other celltypes like Fibroblasts, T cell and B cell in my scRNA reference. 2.How to normalize my bulk data?

    Thanks, Qi.

    opened by zhangqi234 2
  • convert bulk transcriptomes into spatially resolved single-cell expression profile

    convert bulk transcriptomes into spatially resolved single-cell expression profile

    Hi, I'm new to bulk2space, and I only have bulk RNAseq data from mouse brain which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. I know how to convert bulk RNAseq data into single cell data. Here are my questions:

    1. How to get the spatial information from my bulk RNAseq data, do I have to do some experiments about spatial information by Laser capture microdissection (LCM) technology?
    2. since my bulk RNAseq data are form brain tissue, tissues contain many layers of cells. How do you distinguish between different layers of cells? Or do I have to do bulk RNAseq from single layers?

    Thanks, Echo.

    opened by Echoloria 2
  • Cannot import CascadeForestClassifier from deepforest

    Cannot import CascadeForestClassifier from deepforest

    I am running the bulk2space.py script via Python 3.8.5. The deepforest package is installed and imports successfully, but I am still receiving the following error message:

    ImportError: cannot import name 'CascadeForestClassifier' from 'deepforest'

    I would appreciate any help you could offer.

    opened by sarah-chapin 2
  • Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

    Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

    Hi,

    Thank you very much for putting together this code.

    I would like to better understand when Bulk2Space might help versus when there are limits to applicability to Bulk2Space, following a journal club presentation where I learned more about the paper and method.

    I apologize that I am not sure how best to precisely ask my question, but I have tried to use a few examples to try and give a sense of what I am asking about.

    Example 1 (Exact Code for Concrete Test):

    In the spirit of a GitHub “issue,” I tried to start with concrete examples for discussion based upon issue #8 .

    I have attached a summary of that analysis (PDAC_Test.pdf), and I have also attached any input files not already provided on this repository.

    However, when I changed the bulk RNA-Seq gene symbols in order to use the same gene symbols for both the PDAC example and the demo1 example, I lost the Ductal cells in the PDAC example that otherwise still used only files derived from the same samples used for the PDAC example. I also have some more details notes in the uploaded PDF.

    Nevertheless, if that might possibly help the discussion, I have provided those.

    If there are any other relatively small files that it would help to upload to GitHub, then I would also be happy to add those. For example, I also ran the analysis with epoch_num=1000 instead of epoch_num=3500. I am currently not providing those results, but my impression is that they look qualitatively similar in terms of cancer cell and ductal cell assignments (for all of the provided PDAC files).

    Example 2 (Theoretical Question):

    Is it possible to run bulk2spatial as described below?

    1) Use bulk RNA-Seq + scRNA-Seq + spatial data that all come from Patient A.

    2) Export model from Patient A.

    3) Only provide bulk RNA-Seq data from patient B, and test how predictions from model defined on Patient A compare to scRNA-Seq and spatial data generated for Patient B.

    • Additionally, if I understand correctly, then I think an image for the tissue for Patient B can not be provided. If so, I think the shape of the issue section for Patient B can’t be known, and I would guess the spatial coordinates from Bulk2Space might not be directly applicable to interpret Patient B. However, if I might be misunderstanding anything, then please let me know.

    Example 3 (Summary Questions):

    Am I correctly understanding that consecutive slides are often used in the paper? For example, the 2 slices in Figure S17f already have different shapes, and it looks like you a projection of estimations on the histology image for slide 2 was not (or could not?) be provided.

    Data from different patients would be even more different. So, is it reasonable and/or correct to say that there is a preference to use all 3 data types generated from the same experiment? Even if the exact slice is not the same, the true composition of the multiple data types can hopefully be as close as possible?

    For example, I am not sure if the difference is sufficiently extreme, but let’s say Patient A has histology like the “Inflammation” sample in Figure 6 and Patient B has histology like the “Cancer” sample in Figure 6. If you didn’t have a spatial transcriptomics (ST) dataset for Patient B, then I think use of the ST data from Patient A might not be of much benefit to Patient B. Do you think that is a fair conclusion?

    Similarly, if your training sample had 90% tumor, then I would expect limitations is looking at the projection from a spatial transcriptomics project where the tissues had a very different percent tumor such as closer to 20% tumor. I would also expect there often could be a challenge in even knowing the general shape of an independent/unrelated tumor sample, and I believe that you should not be able to know the spatial information for the tumor cells within an independent tissue without a more direct measurement.

    I am not sure if the points above might also possibly relate to the shift in the frequency of cancer cells per spot with the reduced/matching gene symbols in the uploaded PDF for Example 1.

    However, if I am then understanding correctly, then might that be at least somewhat contradictory to what I believe is a recommendation to use public data in issue #7? If I might be misunderstanding anything, then please let me know.

    Thank you very much for your help!

    Sincerely, Charles

    Code.zip demo1_bulk-FALSE_PDAC_LABEL.csv demo1_bulk-FALSE_PDAC_LABEL-MATCHING_SUBSET.csv pdac_bulk-MATCHING_SUBSET.csv

    SC Cell_Type_Counts.pdf SC Cell_Type_Correlation.pdf ST Spot_Deconvolution.pdf ST Cancer_Cells_per_Spot.pdf

    PDAC_Test.pdf

    opened by cwarden45 0
  • Confused about the train/test steps

    Confused about the train/test steps

    Dear Professors,

    Thanks for coming up with this great tool. However, I'm confused with how to use it by the tutorial. In PDAC deconvolution, the tutorial only uses the train_vae function, however, in demo1 tutorial for example, it uses additional load_vae_and_generate function from the .pth vae model from train_vae function.

    So here comes to my question, if I only focus on the first step to transform bulkRNA to single-cell RNA (i.e., no consideration of further scRNA to spatial RNA):

    If I have e.g., two bulkRNAseq from 2-month-old and 7-month-old mice lung cancer tissue, say bulkA and bulkB. I also have one single-cell RNA reference, say scRNAref. When I deconvolute bulkA using scRNAref to a new, bulk2space-generated scRNA data (name it "generated-scRNA from bulkA"), I will get a .pth vae model (name it "A.pth"). Next, when I'd like to deconvolute bulkB, which step should I use? Should I 1) use "load_vae_and_generate" function that use the previous A.pth model, or 2) use "train_vae" function that will generate a new B.pth model?

    I believe this is crucial because it directly guides us how to use this tool. In CIBERSORT, we provide only two variables, the bulkRNAseq and the reference immune cell expression profile. The reference would not change most of the time, thus we just feed CIBERSORT with many bulkRNAseq dataset and it will return many generated immune cell expression dataframes. Simple and easy. But in Bulk2space, we got a new .pth model everytime if we follow step 2, and to be honest, I don't know what this .pth model is used for if not following step1 to use it to load and generate new scRNA dataset.

    Besides issues above, if we use step 1), there'll also be problems. What if bulkA and bulkB are from different status of tissues as the example above? I see that in the article, you mentioned that "the state of each cell type still fluctuates within a relatively stable high-dimensional space". But if bulkA was from a pre-cancerous tissue, and bulkB was from a cancerous tissue, would bulk2space still work fine? This is important because if we'd like to deconvolute bulkRNAseq from longitudinal dataset, for example, a series of bulkRNAseq data from 10 timepoints along cancer progression that contains normal, pre-cancerous, turning stage and finally cancerous tissue, or a series of bulkRNAseq data from different development stages of liver, what is the correct way of using bulk2space if I want single-cell RNA dataset from bulkRNA? Would bulk2space still work under this scenario?

    Also, does bulk2space requires that scRNA ref and bulkRNA are from similar status of tissue? For example, can bulk2space deconvolute bulkRNA derived from cancer lung using the reference scRNA derived from normal lung?

    Actually I've tried to use step 1 (i.e., the same model) to deal with my longitudinal dataset but the results seemed very identical concerning the distribution of cell types that bulk2space returned (which should have some difference at least in immune cell types since I'm deconvoluting bulkRNA from normal and cancer tissues using the same scRNA ref). Also, another key issue is, I don't know whether the generated sc_cell_type and sc_data dataframe can be treated as a standard Seurat object that we can use standard analysis pipeline (like filtering nfeature and nCount, scaling, centering, pca, umap, or newly assign cell types according to FindMarkers function, etc. Acturally I've tried on them but the PCA, tSNE or UMAP can't efficiently separate cell types well), and whether different scRNA datasets generated by bulk2space can be supported to integrate into a single Seurat object like other normal single-cell data do?

    Thank you so much and it would be of a great help if the experts in your team who developped this nice tool could answer the issues above.

    opened by Bennylikescoding 1
  •  β-VAE  algorithm in the paper

    β-VAE algorithm in the paper

    Hello, author, In Figure 1b of your paper,I don't know why β-VAE can analyze the rate of cells of each cell type. I have studied this algorithm carefully and its input and output should correspond, so I don't understand why the input cell type is changed into the output of a single cell. Could you please answer it, or what is the input data of this step? image

    opened by wxpbioinfo 0
  • Question: Scalability

    Question: Scalability

    Good day,

    I am eager to test this excellent tool on our data. I have seen in the tutorial and demo data that the vignette uses only one bulk RNA sample as well as an ST experiment.

    Is it possible to scale up and process several bulk RNA samples and ST experiments in one go? and for the inferred single-cell data derived from the bulk, can we have those integrated across multiple biological replicates, as if they were truly scRNA-seq data?

    Thanks in advance!

    opened by ccruizm 2
  • model.train_df_and_spatial_deconvolution error

    model.train_df_and_spatial_deconvolution error

    Hi, thanks for coming up with this useful tool. When I conducted the model.train_df_and_spatial_deconvolution function to decompose ST data into spatially resolved single-cell transcriptomics data, I found the error like "pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False". I don't know what caused this error.

    1668324980352

    opened by zhangqi234 7
Releases(v1.0.0)
Owner
Dr. FAN, Xiaohui
single-cell omics; spatial transcriptomics; TCM network biology
Dr. FAN, Xiaohui
DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control

DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control One version of our system is implemented using the

260 Nov 28, 2022
A simple code to convert image format and channel as well as resizing and renaming multiple images.

Rename-Resize-and-convert-multiple-images A simple code to convert image format and channel as well as resizing and renaming multiple images. This cod

Happy N. Monday 3 Feb 15, 2022
The pytorch implementation of SOKD (BMVC2021).

Semi-Online Knowledge Distillation Implementations of SOKD. Requirements This repo was tested with Python 3.8, PyTorch 1.5.1, torchvision 0.6.1, CUDA

4 Dec 19, 2021
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially torch.utils.data.dataloade

Sanghyun Son 2.1k Jan 01, 2023
IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL.

IJON SPACE EXPLORER IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL. Using only a small (usually one line) annotati

Chair for Sys­tems Se­cu­ri­ty 146 Dec 16, 2022
Text to image synthesis using thought vectors

Text To Image Synthesis Using Thought Vectors This is an experimental tensorflow implementation of synthesizing images from captions using Skip Though

Paarth Neekhara 2.1k Jan 05, 2023
Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

HamasKhan 3 Jul 08, 2022
A Python wrapper for Google Tesseract

Python Tesseract Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded i

Matthias A Lee 4.6k Jan 05, 2023
A collection of implementations of deep domain adaptation algorithms

Deep Transfer Learning on PyTorch This is a PyTorch library for deep transfer learning. We divide the code into two aspects: Single-source Unsupervise

Yongchun Zhu 647 Jan 03, 2023
Learning Continuous Signed Distance Functions for Shape Representation

DeepSDF This is an implementation of the CVPR '19 paper "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation" by Park et a

Meta Research 1.1k Jan 01, 2023
[ WSDM '22 ] On Sampling Collaborative Filtering Datasets

On Sampling Collaborative Filtering Datasets This repository contains the implementation of many popular sampling strategies, along with various expli

Noveen Sachdeva 17 Dec 08, 2022
Implementation for HFGI: High-Fidelity GAN Inversion for Image Attribute Editing

HFGI: High-Fidelity GAN Inversion for Image Attribute Editing High-Fidelity GAN Inversion for Image Attribute Editing Update: We released the inferenc

Tengfei Wang 371 Dec 30, 2022
Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently

Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently This repository is the official implementat

VITA 4 Dec 20, 2022
Realtime YOLO Monster Detection With Non Maximum Supression

Realtime-YOLO-Monster-Detection-With-Non-Maximum-Supression Table of Contents In

5 Oct 07, 2022
Propose a principled and practically effective framework for unsupervised accuracy estimation and error detection tasks with theoretical analysis and state-of-the-art performance.

Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles This project is for the paper: Detecting Errors and Estimating

Jiefeng Chen 13 Nov 21, 2022
Convert Mission Planner (ArduCopter) Waypoint Missions to Litchi CSV Format to execute on DJI Drones

Mission Planner to Litchi Convert Mission Planner (ArduCopter) Waypoint Surveys to Litchi CSV Format to execute on DJI Drones Litchi doesn't support S

Yaros 24 Dec 09, 2022
A PyTorch version of You Only Look at One-level Feature object detector

PyTorch_YOLOF A PyTorch version of You Only Look at One-level Feature object detector. The input image must be resized to have their shorter side bein

Jianhua Yang 25 Dec 30, 2022
Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination

Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination Pratul P. Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T. Barron,

Pratul Srinivasan 65 Dec 14, 2022
Code Impementation for "Mold into a Graph: Efficient Bayesian Optimization over Mixed Spaces"

Code Impementation for "Mold into a Graph: Efficient Bayesian Optimization over Mixed Spaces" This repo contains the implementation of GEBO algorithm.

Jaeyeon Ahn 2 Mar 22, 2022
A curated list of awesome projects and resources related fastai

A curated list of awesome projects and resources related fastai

Tanishq Abraham 138 Dec 22, 2022