Bulk2Space is a spatial deconvolution method based on deep learning frameworks

Last update: Dec 27, 2022

Related tags

Overview

Bulk2Space

Spatially resolved single-cell deconvolution of bulk transcriptomes using Bulk2Space

Bulk2Space is a spatial deconvolution method based on deep learning frameworks, which converts bulk transcriptomes into spatially resolved single-cell expression profiles.

Installation

For bulk2space, the python version need is over 3.8. If you have installed Python3.6 or Python3.7, consider installing Anaconda, and then you can create a new environment.

conda create -n bulk2space python=3.8.5
conda activate bulk2space

cd bulk2space
pip install -r requirements.txt

Usage

Run the demo data

If you choose the spatial barcoding-based data(like 10x Genomics or ST) as spatial reference, run the following command:

python bulk2space.py --project_name test1 --data_path example_data/demo1 --input_sc_meta_path demo1_sc_meta.csv --input_sc_data_path demo1_sc_data.csv --input_bulk_path demo1_bulk.csv --input_st_data_path demo1_st_data.csv --input_st_meta_path demo1_st_meta.csv --BetaVAE_H --epoch 10 --spot_data True

else, if you choose the image-based in situ hybridization data(like MERFISH, SeqFISH, and STARmap) as spatial reference, run the following command:

python bulk2space.py --project_name test2 --data_path example_data/demo2 --input_sc_meta_path demo2_sc_meta.csv --input_sc_data_path demo2_sc_data.csv --input_bulk_path demo2_bulk.csv --input_st_data_path demo2_st_data.csv --input_st_meta_path demo2_st_meta.csv --BetaVAE_H --epoch 10 --spot_data False

Run your own data

When using your own data, make sure

the bulk.csv file must contain one column of gene expression

Sample

Gene1 5.22

Gene2 3.67

... ...

GeneN 15.76
the sc_meta.csv file must contain two columns of cell name and cell type. Make sure the column names are correct, i.e., Cell and Cell_type

Cell Cell_type

Cell_1 Cell_1 T cell

Cell_2 Cell_2 B cell

... ... ...

Cell_n Cell_n Monocyte
the st_meta.csv file must contain at least two columns of spatial coordinates. Make sure the column names are correct, i.e., xcoord and ycoord

xcoord ycoord

Cell_1 / Spot_1 1.2 5.2

Cell_2 / Spot_2 5.4 4.3

... ... ...

Cell_n / Spot_n 11.3 6.3
the sc_data.csv and st_data.csv files are gene expression matrices

	Sample
Gene1	5.22
Gene2	3.67
...	...
GeneN	15.76

	Cell	Cell_type
Cell_1	Cell_1	T cell
Cell_2	Cell_2	B cell
...	...	...
Cell_n	Cell_n	Monocyte

	xcoord	ycoord
Cell_1 / Spot_1	1.2	5.2
Cell_2 / Spot_2	5.4	4.3
...	...	...
Cell_n / Spot_n	11.3	6.3

Then you will get your results in the output_data folder.

For more details, see user guide in the document.

About

Bulk2Space manuscript is under major revision. Should you have any questions, please contact Jie Liao at [email protected], Jingyang Qian at [email protected], or Yin Fang at [email protected]

Comments

Data availability

Hey team, thanks for coming up with this useful tool. I'm looking to follow your tutorial on hypothalamus deconvolution, and it seems the lcm.gz data file on your Github only contains a single file, without all the processes count matrices and cell metadata table. Is that supposed to be the case? If so, I wonder how I should process this single file to generate the input data I need. Thanks for any heads up!

opened by loganminhdang 6
Cannot locate the bulk2space.py script and directory after installation

Hi, I'm writing to seek your assistance on an issue I'm having. After installation of the conda environment, I cannot locate the bulk2space directory, which should contain the bulk2space python script to run the algorithm. The installation also seems incomplete, seeing that after I manually retrieve the python script from your Github page, I received the following error message: Traceback (most recent call last): File "bulk2space.py", line 2, in from utils.tool import * ModuleNotFoundError: No module named 'utils'

I would appreciate any guidance. Thanks!

opened by loganminhdang 5
Preproccessed PDAC data

Hello,

I am trying to understand how to use bulk2space by going though the tutorials. I am currently going though the first tutorial with the PDAC datasets. I would like to know how you generated the preprocessed files "st_data" and "st_meta".

I went to the original data from Moncada et al. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111672) but I don't know which files you used from there to make the above preprocessed files. Could you clarify that and explain a bit more in detail how you generated "st_data" and "st_meta"? This will be helpful to understand how to process other reference datasets.

opened by AlexUOM 4
the question of "quick start" section

Dear professors, We are very sorry to bother you. We recently downloaded the bulk2space and used the test data of demo1, but we don't know why there are no result output, and we don't know whether the data are written normally. After operation, the Bulk2space-1.0.0-Py3.8egg displayed empty. Some information are as follows. I would wonder if you can help check it in your busy schedule or if there is any other step guidance.

opened by coconutll 2
Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

Hi, thanks for coming up with this useful tool. I have bulk RNAseq data and scRNA-seq data from the same patient which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. Here are my questions: 1.Why do I only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data? However, there are many other celltypes like Fibroblasts, T cell and B cell in my scRNA reference. 2.How to normalize my bulk data?

Thanks, Qi.

opened by zhangqi234 2
convert bulk transcriptomes into spatially resolved single-cell expression profile
Hi, I'm new to bulk2space, and I only have bulk RNAseq data from mouse brain which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. I know how to convert bulk RNAseq data into single cell data. Here are my questions:

How to get the spatial information from my bulk RNAseq data, do I have to do some experiments about spatial information by Laser capture microdissection (LCM) technology?

since my bulk RNAseq data are form brain tissue, tissues contain many layers of cells. How do you distinguish between different layers of cells? Or do I have to do bulk RNAseq from single layers?

Thanks, Echo.
opened by Echoloria 2
Cannot import CascadeForestClassifier from deepforest

I am running the bulk2space.py script via Python 3.8.5. The deepforest package is installed and imports successfully, but I am still receiving the following error message:

ImportError: cannot import name 'CascadeForestClassifier' from 'deepforest'

I would appreciate any help you could offer.

opened by sarah-chapin 2
Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data
Hi,

Thank you very much for putting together this code.

I would like to better understand when Bulk2Space might help versus when there are limits to applicability to Bulk2Space, following a journal club presentation where I learned more about the paper and method.

I apologize that I am not sure how best to precisely ask my question, but I have tried to use a few examples to try and give a sense of what I am asking about.

Example 1 (Exact Code for Concrete Test):

In the spirit of a GitHub “issue,” I tried to start with concrete examples for discussion based upon issue #8 .

I have attached a summary of that analysis (PDAC_Test.pdf), and I have also attached any input files not already provided on this repository.

However, when I changed the bulk RNA-Seq gene symbols in order to use the same gene symbols for both the PDAC example and the demo1 example, I lost the Ductal cells in the PDAC example that otherwise still used only files derived from the same samples used for the PDAC example. I also have some more details notes in the uploaded PDF.

Nevertheless, if that might possibly help the discussion, I have provided those.

If there are any other relatively small files that it would help to upload to GitHub, then I would also be happy to add those. For example, I also ran the analysis with epoch_num=1000 instead of epoch_num=3500. I am currently not providing those results, but my impression is that they look qualitatively similar in terms of cancer cell and ductal cell assignments (for all of the provided PDAC files).

Example 2 (Theoretical Question):

Is it possible to run bulk2spatial as described below?

1) Use bulk RNA-Seq + scRNA-Seq + spatial data that all come from Patient A.

2) Export model from Patient A.

3) Only provide bulk RNA-Seq data from patient B, and test how predictions from model defined on Patient A compare to scRNA-Seq and spatial data generated for Patient B.

Additionally, if I understand correctly, then I think an image for the tissue for Patient B can not be provided. If so, I think the shape of the issue section for Patient B can’t be known, and I would guess the spatial coordinates from Bulk2Space might not be directly applicable to interpret Patient B. However, if I might be misunderstanding anything, then please let me know.

Example 3 (Summary Questions):

Am I correctly understanding that consecutive slides are often used in the paper? For example, the 2 slices in Figure S17f already have different shapes, and it looks like you a projection of estimations on the histology image for slide 2 was not (or could not?) be provided.

Data from different patients would be even more different. So, is it reasonable and/or correct to say that there is a preference to use all 3 data types generated from the same experiment? Even if the exact slice is not the same, the true composition of the multiple data types can hopefully be as close as possible?

For example, I am not sure if the difference is sufficiently extreme, but let’s say Patient A has histology like the “Inflammation” sample in Figure 6 and Patient B has histology like the “Cancer” sample in Figure 6. If you didn’t have a spatial transcriptomics (ST) dataset for Patient B, then I think use of the ST data from Patient A might not be of much benefit to Patient B. Do you think that is a fair conclusion?

Similarly, if your training sample had 90% tumor, then I would expect limitations is looking at the projection from a spatial transcriptomics project where the tissues had a very different percent tumor such as closer to 20% tumor. I would also expect there often could be a challenge in even knowing the general shape of an independent/unrelated tumor sample, and I believe that you should not be able to know the spatial information for the tumor cells within an independent tissue without a more direct measurement.

I am not sure if the points above might also possibly relate to the shift in the frequency of cancer cells per spot with the reduced/matching gene symbols in the uploaded PDF for Example 1.

However, if I am then understanding correctly, then might that be at least somewhat contradictory to what I believe is a recommendation to use public data in issue #7? If I might be misunderstanding anything, then please let me know.

Thank you very much for your help!

Sincerely, Charles

Code.zip demo1_bulk-FALSE_PDAC_LABEL.csv demo1_bulk-FALSE_PDAC_LABEL-MATCHING_SUBSET.csv pdac_bulk-MATCHING_SUBSET.csv

SC Cell_Type_Counts.pdf SC Cell_Type_Correlation.pdf ST Spot_Deconvolution.pdf ST Cancer_Cells_per_Spot.pdf

PDAC_Test.pdf
opened by cwarden45 0
Confused about the train/test steps

Dear Professors,

Thanks for coming up with this great tool. However, I'm confused with how to use it by the tutorial. In PDAC deconvolution, the tutorial only uses the train_vae function, however, in demo1 tutorial for example, it uses additional load_vae_and_generate function from the .pth vae model from train_vae function.

So here comes to my question, if I only focus on the first step to transform bulkRNA to single-cell RNA (i.e., no consideration of further scRNA to spatial RNA):

If I have e.g., two bulkRNAseq from 2-month-old and 7-month-old mice lung cancer tissue, say bulkA and bulkB. I also have one single-cell RNA reference, say scRNAref. When I deconvolute bulkA using scRNAref to a new, bulk2space-generated scRNA data (name it "generated-scRNA from bulkA"), I will get a .pth vae model (name it "A.pth"). Next, when I'd like to deconvolute bulkB, which step should I use? Should I 1) use "load_vae_and_generate" function that use the previous A.pth model, or 2) use "train_vae" function that will generate a new B.pth model?

I believe this is crucial because it directly guides us how to use this tool. In CIBERSORT, we provide only two variables, the bulkRNAseq and the reference immune cell expression profile. The reference would not change most of the time, thus we just feed CIBERSORT with many bulkRNAseq dataset and it will return many generated immune cell expression dataframes. Simple and easy. But in Bulk2space, we got a new .pth model everytime if we follow step 2, and to be honest, I don't know what this .pth model is used for if not following step1 to use it to load and generate new scRNA dataset.

Besides issues above, if we use step 1), there'll also be problems. What if bulkA and bulkB are from different status of tissues as the example above? I see that in the article, you mentioned that "the state of each cell type still fluctuates within a relatively stable high-dimensional space". But if bulkA was from a pre-cancerous tissue, and bulkB was from a cancerous tissue, would bulk2space still work fine? This is important because if we'd like to deconvolute bulkRNAseq from longitudinal dataset, for example, a series of bulkRNAseq data from 10 timepoints along cancer progression that contains normal, pre-cancerous, turning stage and finally cancerous tissue, or a series of bulkRNAseq data from different development stages of liver, what is the correct way of using bulk2space if I want single-cell RNA dataset from bulkRNA? Would bulk2space still work under this scenario?

Also, does bulk2space requires that scRNA ref and bulkRNA are from similar status of tissue? For example, can bulk2space deconvolute bulkRNA derived from cancer lung using the reference scRNA derived from normal lung?

Actually I've tried to use step 1 (i.e., the same model) to deal with my longitudinal dataset but the results seemed very identical concerning the distribution of cell types that bulk2space returned (which should have some difference at least in immune cell types since I'm deconvoluting bulkRNA from normal and cancer tissues using the same scRNA ref). Also, another key issue is, I don't know whether the generated sc_cell_type and sc_data dataframe can be treated as a standard Seurat object that we can use standard analysis pipeline (like filtering nfeature and nCount, scaling, centering, pca, umap, or newly assign cell types according to FindMarkers function, etc. Acturally I've tried on them but the PCA, tSNE or UMAP can't efficiently separate cell types well), and whether different scRNA datasets generated by bulk2space can be supported to integrate into a single Seurat object like other normal single-cell data do?

Thank you so much and it would be of a great help if the experts in your team who developped this nice tool could answer the issues above.

opened by Bennylikescoding 1
β-VAE algorithm in the paper

Hello, author, In Figure 1b of your paper,I don't know why β-VAE can analyze the rate of cells of each cell type. I have studied this algorithm carefully and its input and output should correspond, so I don't understand why the input cell type is changed into the output of a single cell. Could you please answer it, or what is the input data of this step?

opened by wxpbioinfo 0
Question: Scalability

Good day,

I am eager to test this excellent tool on our data. I have seen in the tutorial and demo data that the vignette uses only one bulk RNA sample as well as an ST experiment.

Is it possible to scale up and process several bulk RNA samples and ST experiments in one go? and for the inferred single-cell data derived from the bulk, can we have those integrated across multiple biological replicates, as if they were truly scRNA-seq data?

Thanks in advance!

opened by ccruizm 2
model.train_df_and_spatial_deconvolution error

Hi, thanks for coming up with this useful tool. When I conducted the model.train_df_and_spatial_deconvolution function to decompose ST data into spatially resolved single-cell transcriptomics data, I found the error like "pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False". I don't know what caused this error.

opened by zhangqi234 7

Releases(v1.0.0)

v1.0.0(Oct 2, 2022)

The v1.0.0 released version of Bulk2Space. Bulk2Space is a two-step spatial deconvolution method based on deep learning frameworks, which converts bulk transcriptomes into spatially resolved single-cell expression profiles.
Source code(tar.gz)
Source code(zip)
bulk2space-release.zip(8.51 MB)

Owner

Dr. FAN, Xiaohui

single-cell omics; spatial transcriptomics; TCM network biology

GitHub Repository

Hyperparameter Optimization for TensorFlow, Keras and PyTorch

Hyperparameter Optimization for Keras Talos • Key Features • Examples • Install • Support • Docs • Issues • License • Download Talos radically changes

1.6k Dec 15, 2022

Official Repsoitory for "Activate or Not: Learning Customized Activation." [CVPR 2021]

CVPR 2021 | Activate or Not: Learning Customized Activation. This repository contains the official Pytorch implementation of the paper Activate or Not

184 Dec 27, 2022

Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

Inter-Prototype (BMVC 2021): Official Project Webpage This repository provides the official PyTorch implementation of the following paper: Improving F

16 Jun 30, 2022

Adjusting for Autocorrelated Errors in Neural Networks for Time Series

Adjusting for Autocorrelated Errors in Neural Networks for Time Series This repository is the official implementation of the paper "Adjusting for Auto

51 Nov 05, 2022

Generalized Proximal Policy Optimization with Sample Reuse (GePPO)

Generalized Proximal Policy Optimization with Sample Reuse This repository is the official implementation of the reinforcement learning algorithm Gene

9 Nov 28, 2022

A library for researching neural networks compression and acceleration methods.

100 Dec 29, 2022

A GridMixup augmentation, inspired by GridMask and CutMix

GridMixup A GridMixup augmentation, inspired by GridMask and CutMix Easy install pip install git+https://github.com/IlyaDobrynin/GridMixup.git Overvie

42 Dec 28, 2022

Jax/Flax implementation of Variational-DiffWave.

jax-variational-diffwave Jax/Flax implementation of Variational-DiffWave. (Zhifeng Kong et al., 2020, Diederik P. Kingma et al., 2021.) DiffWave with

37 Dec 16, 2022

Simple and Distributed Machine Learning

Synapse Machine Learning SynapseML (previously MMLSpark) is an open source library to simplify the creation of scalable machine learning pipelines. Sy

3.9k Dec 30, 2022

PyContinual (An Easy and Extendible Framework for Continual Learning)

PyContinual (An Easy and Extendible Framework for Continual Learning) Easy to Use You can sumply change the baseline, backbone and task, and then read

176 Jan 05, 2023

We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction

We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction. This repository aims to give easy access to state-of-the-art pre-train

90 Jan 08, 2023

Iris prediction model is used to classify iris species created julia's DecisionTree, DataFrames, JLD2, PlotlyJS and Statistics packages.

Iris Species Predictor Iris prediction is used to classify iris species using their sepal length, sepal width, petal length and petal width created us

2 Jan 06, 2022