Regulatory network analysis through only scRNA-seq data

IReNA (Integrated Regulatory Network Analysis) is an R package to perform regulatory network analysis. When scATAC-seq/ATAC-seq data is not available, IReNA analyzes the binding motifs of target gene in each gene pair to refine potential regulatory relationships inferred through GENIE3. Then, performing network analysis through the hypergeometric test.

To run this pipeline, Computer or server of linux system and fimo software are needed.

1. Test data download

Test data used in the following steps can be downloaded from google drive.

2. Workflow

3. Inputs of IReNA

(1) Seurat object with Pseudotime in the metadata

This seurat object can be generated in scRNA-seq preprocessing pipeline

Hint: seurat object used here should only contain interest genes(e.g. differentially expressed genes and expressed transcription factors)

If bulk RNA-seq data are used to identify basic regulatory relationships, just load raw counts of bulk RNA-seq data, then use the same codes as the scRNA-seq data for further analysis. We suggest to load the expression profile of only differentially expressed genes, otherwise much more time will be used to calculate the correlation of each gene pair. Deseq2 and edgeR can be used to identify differentially expressed genes in bulk RNA-seq data.

(2) GTF file

Gene transfer format, you can download it from http://www.ensembl.org/info/data/ftp/index.html

(3) Motif information of transcription factors

IReNA contains DNA motif datasets for four species (Homo sapiens, Mus musculus, Zebrafish and Chicken) derived from TRANSFAC version 201803. Following codes are used to get the motif dataset from TRANSFAC or user-defined motif dataset which should have the same format as these from TRANSFAC database.

library(IReNA)
###call Mus musculus motif database
motif1 <- Tranfac201803_Mm_MotifTFsF
###call Homo sapiens motif database
motif1 <- Tranfac201803_Hs_MotifTFsF
###call Zebrafish motif database
motif1 <- Tranfac201803_Zf_MotifTFsF
###call Chicken motif database
motif1 <- Tranfac201803_Ch_MotifTFsF

(4) Motif position weight matirx (PWM)

PWM of all species can be download from https://github.com/jiang-junyao/MEMEmotif

4. Tutorial

IReNA contains four main parts to reconstruct regulatory networks:

Part 1: Analyze scRNA-seq or bulk RNA-seq data to get basic regulatory relationships

Part 2: Use Fimo (or alternative option: RcisTarget) to refine regulatory relaionships (without ATAC-seq data)

Part 3: Regulatory network analysis and visualization

Part 1: Analyze scRNA-seq or bulk RNA-seq data to get basic regulatory relationships

IReNA supports three formats of the inputs:

Input data of this part: Seurat object with Pseudotime in the metadata, which can be generated in scRNA-seq preprocessing pipeline

The test data for this part (‘seurat_with_time.rds’) can be downloaded from google drive. The test data are from the study of human spermatogonial stem cell development which is available in https://www.cell.com/cell-stem-cell/fulltext/S1934-5909(17)30370-3.

First,IReNA divide cells into 50 bins across pseudotime. The bin is removed if all genes in this bin have no expression. The gene is filtered if absolute fold change < 0.01 (set by the parameter ‘FC’). Then, genes will be clustered through K-means algorithm (the number of clusters ‘K’ is set by the parameter ‘K1’).

If bulk RNA-seq data are used to identify regulatory relationships, load your expression matrix as expression_profile that generated by get_SmoothByBin_PseudotimeExp(), then run the following codes. I suggest to input expression profile that only contains differentially expressed genes, or you will a huge of time to calculate correlation of each gene pair.

###Load the test data
seurat_with_time <- readRDS('seurat_with_time.rds')
###Get expression profiles ordered by pseudotime
expression_profile <- get_SmoothByBin_PseudotimeExp(seurat_with_time, Bin = 50)
###Filter noise and logFC in expression profile
expression_profile_filter <- filter_expression_profile(expression_profile, FC=0.01)
###K-means clustering
clustering <- clustering_Kmeans(expression_profile_filter, K1=4)
clustering[1:5,1:5]
##        KmeansGroup FoldChangeQ95     SmExp1     SmExp2     SmExp3
## TCEB3            1      2.395806 -0.2424532 -0.8964990 -0.9124960
## CLK1             1      2.508335 -0.1819044  0.7624798  0.4867972
## MATR3            1      2.700294 -1.4485729  0.7837425  0.3028892
## AKAP11           1      2.415084 -0.6120681 -0.3849580  0.3898393
## HSF2             1      2.528111 -0.8125698 -0.6166004  0.8533309

Visualize the clustering of gene expression profiles through the heatmap

plot_kmeans_pheatmap(clustering,ModuleColor1 = c('#67C7C1','#5BA6DA','#FFBF0F','#C067A9'))

Because of the characteristics of Kmeans algorithm, different results will be obtained each time clustering.

Then, Add Ensmble ID of the genes in the first column.

###Add Ensembl ID as the first column of clustering results
Kmeans_clustering_ENS <- add_ENSID(clustering, Spec1='Hs')
Kmeans_clustering_ENS[1:5,1:5]
##                 Symbol KmeansGroup FoldChangeQ95     SmExp1     SmExp2
## ENSG00000011007  TCEB3           1      2.395806 -0.2424532 -0.8964990
## ENSG00000013441   CLK1           1      2.508335 -0.1819044  0.7624798
## ENSG00000015479  MATR3           1      2.700294 -1.4485729  0.7837425
## ENSG00000023516 AKAP11           1      2.415084 -0.6120681 -0.3849580
## ENSG00000025156   HSF2           1      2.528111 -0.8125698 -0.6166004

Then IReNA infers the potential regulation for each gene in the expression profile through GENIE3, one of the top performing method for network inference as listed in Beeline benchmark. Futhermore, based on smoothed expression profile, regulation type of each gene pair is defined by Person’s correlation, correlation > 0 reprsents activation, correlation < 0 reprsents repression.

weightMat <- GENIE3(as.matrix(seurat_with_time@assays$RNA@data),nCores = 50)
weightMat <- getLinkList(weightMat)
regulation <- weightMat[weightMat[,3]>0.0002,]
### add regulation type for each gene pair
regulatory_relationships <- add_regulation_type(Kmeans_clustering_ENS,regulation)
### check whether source genes are transcription factors
motifTF <- c()
for (i in 1:nrow(motif1)) {
  TF <- strsplit(motif1[i,5],';')[[1]]
  motifTF <- c(motifTF,TF)
}
regulatory_relationships <- regulatory_relationships[regulatory_relationships[,1] %in% motifTF,]

However, GENIE3 costs a huge of running time and memory compare to Person’s correlation. So, we also provide functions to infer potential regulation through Person’s correlation.

regulatory_relationships <- get_cor(Kmeans_clustering_ENS, motif = motif1, correlation_filter = 0.6, start_column = 4)

Part 2: Analyze binding motifs of target genes to refine regulatory relaionships (without ATAC-seq data)

For ATAC-seq data is not available. IReNA uses fimo to check whether motifs of transcription factors that regulate the target gene occur in the upstream of target genes. If motifs of transcription factors that regulate the target gene exist, this gene pair will be retained.

First, get TSS (transcription start site) regions (default is upstream 1000 to downstream 500) of target genes. Gtf file used here is available from http://www.ensembl.org/info/data/ftp/index.html

gtf <- read.delim("/public/home/user/gtf/Homo_sapiens.GRCh38.104.gtf", header=FALSE, comment.char="#")
### modify chromosome names in the gtf file to let it match chromosome names in the reference genome
gtf[,1] <- paste0('chr',gtf[,1])
gene_tss <- get_tss_region(gtf,rownames(Kmeans_clustering_ENS))
head(gene_tss)
##              gene  chr    start      end
## 1 ENSG00000237973 chr1   630074   631574
## 2 ENSG00000116285 chr1  8027309  8025809
## 3 ENSG00000162441 chr1  9944407  9942907
## 4 ENSG00000116273 chr1  6612731  6614231
## 5 ENSG00000177674 chr1 11735084 11736584
## 6 ENSG00000171608 chr1  9650731  9652231

Then, get the sequence of these TSS regions based on reference genome (reference genome can be download from UCSC database), and use fimo to scan these sequence to check whether the motif of transcription factor that regulates the target gene occur in the promoter regions of the target gene. Because fimo software only supports linux environment, we generate some shell script to run Fimo software.

You should set the following four parameters:

  1. refdir: path of reference genome

  2. fimodir: path of Fimo software. If Fimo has been set to the global environment variable, just set ‘fimodir <- fimo’.

  3. outputdir1: output path of the shell scripts and sequence of target genes tss regions.(function ‘find_motifs_targetgenes’ will automatically generate two folders (‘fasta’ and ‘fimo’) in the path ‘outputdir1’, and store sequence of target genes tss regions in the ‘fasta’ and shell scripts in the ‘fimo’)

  4. Motifdir: path of the motif file, which can be downloaded from https://github.com/jiang-junyao/MEMEmotif or TRANSFAC database.

Please note that, at the end of ‘outputdir1’,‘motifdir’ and ‘sequencedir’, the symbol ’/’should be contained. What’s more, the chromosome name of your reference genome used here should be the same as the chromosome name in the gene_tss

### run the following codes in the R under linux environment
refdir='/public/home/user/genome/hg38.fa'
fimodir <- 'fimo'
outputdir1 <- '/public/home/user/fimo/'
motifdir <- '/public/home/user/fimo/Mememotif/'
find_motifs_targetgenes(gene_tss,motif1,refdir,fimodir,outputdir1,motifdir)
### run fimo_all script in shell
shell_code <- paste0('sh ',outputdir1,'fimo/fimoall.sh')
system(shell_code,wait=TRUE)

Then, Fimo result are stored in the ‘fimo’ folder under outputdir1, and we run the following R codes to refine regulatory relationships

motif1 <- Tranfac201803_Hs_MotifTFsF
outputdir <- paste0(outputdir1,'fimo/')
fimo_regulation <- generate_fimo_regulation(outputdir,motif1)
filtered_regulatory_relationships <- filter_regulation_fimo(fimo_regulation, regulatory_relationships)

Alternative, motifmatchr can bed used to generate the same stuff

library(BSgenome.Hsapiens.UCSC.hg38)
gene_tss_GR <- GRanges(paste0(gene_tss[,2],':',gene_tss[,3],'-',gene_tss[,4]))
fimo_regulation <- identify_region_tfs(gene_tss_GR,gene_tss[,1],motif1,BSdb=BSgenome.Hsapiens.UCSC.hg38)
filtered_regulatory_relationships <- filter_regulation_fimo(fimo_regulation, regulatory_relationships)

Part 3: Regulatory network analysis and visualization

After we get ‘filtered_regulatory_relationships’ and ‘Kmeans_clustering_ENS’, we can reconstruct regulatory network. Run network_analysis() to get regulatory, this function will generate a list which contain the following 9 elements:

(1)Cor_TFs.txt: list of expressed TFs in the gene networks.

(2)Cor_EnTFs.txt: list of TFs which significantly regulate gene modules (or enriched TFs).

(3)FOSF_RegMTF_Cor_EnTFs.txt: regulatory pairs in which the source gene is enriched TF.

(4)FOSF_RegMTF_Cor_EnTFs.txt: regulatory pairs in which both source gene and target gene are enriched TFs.

(5)FOSF_RegMTF_Cor_EnTFs.txt: regulatory pairs only including regulations within each module but not those between modules, in this step

(6)TF_list: enriched TFs which significantly regulate gene modules

(7)TF_module_regulation: details of enriched TFs which significantly regulate gene modules

(8)TF_network: regulatory network for enriched transcription factors of each module

(9)intramodular_network: intramodular regulatory network

Here, we use refined regulatory relationships from part2 to reconstruct regulatory networks

TFs_list <- network_analysis(filtered_regulatory_relationships,Kmeans_clustering_ENS,TFFDR1 = 10,TFFDR2 = 10)

We can also make enrichment analysis for differentially expressed genes in each module. Before you run this function, you need to download the org.db of your species through BiocManager.

### Download Homo sapiens org.db
#BiocManger::install('org.Hs.eg.db')
library(org.Hs.eg.db)
### Enrichment analysis (KEGG)
enrichment_KEGG <- enrich_module(Kmeans_clustering_ENS, org.Hs.eg.db, enrich.db = 'KEGG',organism = 'hsa')
#enrichment_GO <- enrich_module(Kmeans_cluster_ENS, org.Hs.eg.db, 'GO')

Moreover, you can do GO analysis

library(org.Hs.eg.db)
### Enrichment analysis (GO)
enrichment_GO <- enrich_module(Kmeans_clustering_ENS, enrich.db = 'GO',org.Hs.eg.db)

You can visualize regulatory networks for enriched transcription factors of each module through plot_network() function by setting type parameter as ‘TF’. This plot shows regulatory relationships between transcription factors in different modules that significantly regulating other modules. The size of each vertex determine the significance of this transcription factor. Yellow edges are positive regulation, grey edges are negative regulation.

plot_tf_network(TFs_list)

You can visualize intramodular network with enriched function through plot_intramodular_network() function. Before run this function, you can select enriched functions of each module that you want to present in the plot. If you input all enriched functions, this function will automatically select the function with the highest -log10(qvalue) in each module to present in the plot. What’s more transcription factor with the most edge numbers in each module will be presented in the plot too.

### select functions that you want to present in the figure
enrichment_GO_select <- enrichment_GO[c(1,11,21),]
### plotting
plot_intramodular_network(TFs_list,enrichment_GO_select,layout = 'circle')

It is strongly recommended to use Cytoscape to display the regulatory networks. We provide a function that can provide different Cytoscape styles. You need to install and open Cytoscape before running the function.

###optional: display the network in cytoscape, open cytoscape before running this function
initiate_cy(TFs_list, layout1='degree-circle', type='TF')
initiate_cy(TFs_list, layout1='grid', type='module')

These are the picture we processed through Cytoscape, which can show the regulatory relationship of modularized transcription factors.

Use Cytoscape to visualize regulatory network for enriched transcription factors of each module Cytoscape_network

Use Cytoscape to visualize intramodular network Cytoscape_intramodular