scRNA-seq preprocessing pipeline
This pipeline is developed for scRNA-seq data pre-processing of IReNA. This pipline finally generates a seurat object(refined_seurat) which is input of IReNA. Test data used in IReNA can be download from https://github.com/jiang-junyao/IReNA-test-data
Inputs of this pipline
(1) seurat object
(2) Raw counts
For users who only have raw counts, IReNA provides the function ‘load_counts’ to load raw counts of scRNA-seq data, and return seurat object. If the data is 10X format, set the parameter ‘datatype = 0’. If the data is normal counts format (‘txt’ as the suffix of the file name), set the parameter ‘dayatype =1’.
### load 10X counts
seurat_object <- load_counts('10X_data/sample1/', datatype = 0)
### load normal counts
seurat_object <- load_counts('test_data.txt',datatype = 1)
(3) Motif information of transcription factors
IReNA contains DNA motif datasets for four species (Homo sapiens, Mus musculus, Zebrafish and Chicken) derived from TRANSFAC version 201803. Following codes are used to get the motif dataset from TRANSFAC or user-defined motif dataset which should have the same format as these from TRANSFAC database.
library(IReNA)
###call Mus musculus motif database
<- Tranfac201803_Mm_MotifTFsF
motif1 ###call Homo sapiens motif database
<- Tranfac201803_Hs_MotifTFsF
motif1 ###call Zebrafish motif database
<- Tranfac201803_Zf_MotifTFsF
motif1 ###call Chicken motif database
<- Tranfac201803_Ch_MotifTFsF motif1
Step 1: Calculate the pseudotime
In this step, IReNA integrate R package monocle2 to calculate the pseudotime of each cell, and then add the pseudotime to the metadata of seurat object. Alternatively, you can use slingshot, monocle3, TSCAN or other related methods to infer pseudotime, and add it to the metadata of seurat object by your own codes (function ‘add_pseudotime’ only support monocle object from monocle2).
Parameter ‘gene.use’ in ‘get_pseudotime’ function indicate the genes use to calculate pseudotime, if it’s null, this function will automatically use variable genes calculated by ‘FindVariableFeatures’ function in Seurat package to calculate pseudotime.
### Read seurat_object
library(IReNA)
<- readRDS('seurat_object.rds')
seurat_object ### Calculate the pseudotime and return monocle object
<- get_pseudotime(seurat_object,gene.use = rownames(seurat_object))
monocle_object ###Add pseudotime to the Seurat object
### This function only support monocle object from monocle2
<- add_pseudotime(seurat_object, monocle_object) seurat_with_time
Step 2: Identify DEGs and expressed transcription factors(TFs)
In this step, firstly we identify DEGs across the pseudotime and exprssed TFs (expressed in more than 5% of cells). Then, using above genes to refine the seurat object. We recommend ‘differentialGeneTest’ function in monocle2 to calculate DEGs across pseudotime. DEGs and expressed TFs will be used to make expression profile in further analysis.
### Identify DEGs across pseudotime (qvalue < 0.05 and num_cells_expressed > 0.1)
library(monocle)
<- detectGenes(monocle_object, min_expr = 1)
monocle_object <- estimateDispersions(monocle_object)
monocle_object <- monocle::differentialGeneTest(monocle_object,fullModelFormulaStr = "~Pseudotime",relative_expr = TRUE)
diff1 <- subset(diff1, qval < 0.05)
sig_genes <- subset(sig_genes, num_cells_expressed > 0.1)
sig_genes ### select Candidate TFs.
<- c()
Candidate_TFs for (i in 1:nrow(motif1)) {
<- strsplit(motif1[i,5],';')[[1]]
gene1 <- c(Candidate_TFs,gene1)
Candidate_TFs
}### Identify expressed TFs
### Canidate TFs in our motif database are ensemble ID, If your gene names in
### seurat object are Symbol ID, you need to transfer
### Canidate TFs to Symbol ID first.
<- rownames(extract_expressed_TFs(seurat_object,Candidate_TFs))
expressed_tf <- expressed_tf[!expressed_tf%in%rownames(sig_genes)]
expressed_tf ### Refine the seurat object
<- subset(seurat_with_time, features = c(expressed_tf,rownames(sig_genes))) refined_seurat
if you already have identified DEGs, you just need to run subset function in seurat:
### DEGs used here is the character class
<- subset(seurat_with_time, features = DEGs) seurat_with_time
Hint : In most situations, i highly recommend using DEGs and expressed TFs to refine the seurat object, then reconstruct the regulatory networks. However, if your project has some special purpose, you could skip step2, and use all genes to reconstruct the regulatory networks.