scRNA-seq preprocessing pipeline

This pipeline is developed for scRNA-seq data pre-processing of IReNA. This pipline finally generates a seurat object(refined_seurat) which is input of IReNA. Test data used in IReNA can be download from https://github.com/jiang-junyao/IReNA-test-data

Inputs of this pipline

(2) Raw counts

For users who only have raw counts, IReNA provides the function ‘load_counts’ to load raw counts of scRNA-seq data, and return seurat object. If the data is 10X format, set the parameter ‘datatype = 0’. If the data is normal counts format (‘txt’ as the suffix of the file name), set the parameter ‘dayatype =1’.

### load 10X counts
seurat_object <- load_counts('10X_data/sample1/', datatype = 0)
### load normal counts
seurat_object <- load_counts('test_data.txt',datatype = 1)

(3) Motif information of transcription factors

IReNA contains DNA motif datasets for four species (Homo sapiens, Mus musculus, Zebrafish and Chicken) derived from TRANSFAC version 201803. Following codes are used to get the motif dataset from TRANSFAC or user-defined motif dataset which should have the same format as these from TRANSFAC database.

library(IReNA)
###call Mus musculus motif database
motif1 <- Tranfac201803_Mm_MotifTFsF
###call Homo sapiens motif database
motif1 <- Tranfac201803_Hs_MotifTFsF
###call Zebrafish motif database
motif1 <- Tranfac201803_Zf_MotifTFsF
###call Chicken motif database
motif1 <- Tranfac201803_Ch_MotifTFsF

Step 1: Calculate the pseudotime

In this step, IReNA integrate R package monocle2 to calculate the pseudotime of each cell, and then add the pseudotime to the metadata of seurat object. Alternatively, you can use slingshot, monocle3, TSCAN or other related methods to infer pseudotime, and add it to the metadata of seurat object by your own codes (function ‘add_pseudotime’ only support monocle object from monocle2).

Parameter ‘gene.use’ in ‘get_pseudotime’ function indicate the genes use to calculate pseudotime, if it’s null, this function will automatically use variable genes calculated by ‘FindVariableFeatures’ function in Seurat package to calculate pseudotime.

### Read seurat_object
library(IReNA)
seurat_object <- readRDS('seurat_object.rds')
### Calculate the pseudotime and return monocle object
monocle_object <- get_pseudotime(seurat_object,gene.use = rownames(seurat_object))
###Add pseudotime to the Seurat object
### This function only support monocle object from monocle2
seurat_with_time <- add_pseudotime(seurat_object, monocle_object)

Step 2: Identify DEGs and expressed transcription factors(TFs)

In this step, firstly we identify DEGs across the pseudotime and exprssed TFs (expressed in more than 5% of cells). Then, using above genes to refine the seurat object. We recommend ‘differentialGeneTest’ function in monocle2 to calculate DEGs across pseudotime. DEGs and expressed TFs will be used to make expression profile in further analysis.

### Identify DEGs across pseudotime (qvalue < 0.05 and num_cells_expressed > 0.1)
library(monocle)
monocle_object <- detectGenes(monocle_object, min_expr = 1)
monocle_object <- estimateDispersions(monocle_object)
diff1 <- monocle::differentialGeneTest(monocle_object,fullModelFormulaStr = "~Pseudotime",relative_expr = TRUE)
sig_genes <- subset(diff1, qval < 0.05)
sig_genes <- subset(sig_genes, num_cells_expressed > 0.1)
### select Candidate TFs.
Candidate_TFs <- c()
for (i in 1:nrow(motif1)) {
  gene1 <- strsplit(motif1[i,5],';')[[1]]
  Candidate_TFs <- c(Candidate_TFs,gene1)
}
### Identify expressed TFs
### Canidate TFs in our motif database are ensemble ID, If your gene names in 
### seurat object are Symbol ID, you need to transfer 
### Canidate TFs to Symbol ID first.
expressed_tf <- rownames(extract_expressed_TFs(seurat_object,Candidate_TFs))
expressed_tf <- expressed_tf[!expressed_tf%in%rownames(sig_genes)]
### Refine the seurat object
refined_seurat <- subset(seurat_with_time, features = c(expressed_tf,rownames(sig_genes)))

if you already have identified DEGs, you just need to run subset function in seurat:

### DEGs used here is the character class
seurat_with_time <- subset(seurat_with_time, features = DEGs)

Hint : In most situations, i highly recommend using DEGs and expressed TFs to refine the seurat object, then reconstruct the regulatory networks. However, if your project has some special purpose, you could skip step2, and use all genes to reconstruct the regulatory networks.