Integrate bulk ATAC-seq and scRNA-seq data to reconstruct regulatory networks
Compare to reconstruct regulatory network only using scRNA-seq data, integrated analysis of bulk ATAC-seq with scRNA-seq overall substantially improve the reconstruction of gene regulatory networks, and have higher precision of identifying known transcription factors. Here, we provide an example of integrating scRNA-seq data and bulk ATAC-seq data to reconstruct regulatory networks for spermatogonial stem cell development. IReNA analyzes scRNA-seq to infer potential regulatory relationships, and then uses footprints with high footprint occupancy score to refine potential regulatory relationships. Finally performing network analysis through hypergeometric test.
To run this pipeline, Computer or server of linux system and the following software are needed: samtools, bedtools and fimo.
1. Test data download
Test data used in the following steps can be downloaded from google drive. The raw data of ATAC-seq used to run ATAC-seq preprocessing pipline are available from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92279
2. Workflow
3. Inputs of IReNA
(1) Bam, peak and footprint files of ATAC-seq data
Bam, peak and footprint files of ATAC-seq data can be separately generated by the step 4, step 8, and step 10 in ATAC-seq analysis pipline.
(2) Seurat object with Pseudotime in the metadata
This seurat object can be generated in scRNA-seq preprocessing pipeline
Hint: seurat object used here should only contain interest genes(e.g. differentially expressed genes and expressed transcription factors)
If bulk RNA-seq data are used to identify basic regulatory relationships, just load raw counts of bulk RNA-seq data, then use the same codes as the scRNA-seq data for further analysis. We suggest to load the expression profile of only differentially expressed genes, otherwise much more time will be used to calculate the correlation of each gene pair. Deseq2 and edgeR can be used to identify differentially expressed genes in bulk RNA-seq data.
(3) Reference genome of the species
Reference genome should be the same as that used for mapping ATAC-seq data, which can be downloaded from UCSC database.
(4) Motif information of transcription factors
IReNA contains DNA motif datasets for four species (Homo sapiens, Mus musculus, Zebrafish and Chicken) derived from TRANSFAC version 201803. Following codes are used to get the motif dataset from TRANSFAC or user-defined motif dataset which should have the same format as these from TRANSFAC database.
library(IReNA)
###call Mus musculus motif database
<- Tranfac201803_Mm_MotifTFsF
motif1 ###call Homo sapiens motif database
<- Tranfac201803_Hs_MotifTFsF
motif1 ###call Zebrafish motif database
<- Tranfac201803_Zf_MotifTFsF
motif1 ###call Chicken motif database
<- Tranfac201803_Ch_MotifTFsF motif1
(5) Motif position weight matirx (PWM)
PWM of all species can be download from https://github.com/jiang-junyao/MEMEmotif
4. Tutorial
IReNA contains four main parts to reconstruct regulatory networks:
Part 1: Analyze scRNA-seq or bulk RNA-seq data to get basic regulatory relationships
Part 2: Analyze bulk ATAC-seq data to refine regulatory relationships (with ATAC-seq data)
Part 3: Regulatory network analysis and visualization
Part 1: Analyze scRNA-seq or bulk RNA-seq data to get basic regulatory relationships
IReNA supports three formats of the inputs:
Input data of this part: Seurat object with Pseudotime in the metadata, which can be generated in scRNA-seq preprocessing pipeline
The test data for this part (‘seurat_with_time.rds’) can be downloaded from google drive. The test data are from the study of human spermatogonial stem cell development which is available in https://www.cell.com/cell-stem-cell/fulltext/S1934-5909(17)30370-3.
First,IReNA divide cells into 50 bins across pseudotime. The bin is removed if all genes in this bin have no expression. The gene is filtered if absolute fold change < 0.01 (set by the parameter ‘FC’). Then, genes will be clustered through K-means algorithm (the number of clusters ‘K’ is set by the parameter ‘K1’).
If bulk RNA-seq data are used to identify regulatory relationships, load your expression matrix as expression_profile that generated by get_SmoothByBin_PseudotimeExp(), then run the following codes. I suggest to input expression profile that only contains differentially expressed genes, or you will a huge of time to calculate correlation of each gene pair.
###Load the test data
seurat_with_time <- readRDS('seurat_with_time.rds')
###Get expression profiles ordered by pseudotime
expression_profile <- get_SmoothByBin_PseudotimeExp(seurat_with_time, Bin = 50)
###Filter noise and logFC in expression profile
expression_profile_filter <- filter_expression_profile(expression_profile, FC=0.01)
###K-means clustering
clustering <- clustering_Kmeans(expression_profile_filter, K1=4)
1:5,1:5] clustering[
## KmeansGroup FoldChangeQ95 SmExp1 SmExp2 SmExp3
## TCEB3 1 2.395806 -0.2424532 -0.8964990 -0.9124960
## CLK1 1 2.508335 -0.1819044 0.7624798 0.4867972
## MATR3 1 2.700294 -1.4485729 0.7837425 0.3028892
## AKAP11 1 2.415084 -0.6120681 -0.3849580 0.3898393
## HSF2 1 2.528111 -0.8125698 -0.6166004 0.8533309
Visualize the clustering of gene expression profiles through the heatmap
plot_kmeans_pheatmap(clustering,ModuleColor1 = c('#67C7C1','#5BA6DA','#FFBF0F','#C067A9'))
Because of the characteristics of Kmeans algorithm, different results will be obtained each time clustering.
Then, Add Ensmble ID of the genes in the first column.
###Add Ensembl ID as the first column of clustering results
<- add_ENSID(clustering, Spec1='Hs')
Kmeans_clustering_ENS 1:5,1:5] Kmeans_clustering_ENS[
## Symbol KmeansGroup FoldChangeQ95 SmExp1 SmExp2
## ENSG00000011007 TCEB3 1 2.395806 -0.2424532 -0.8964990
## ENSG00000013441 CLK1 1 2.508335 -0.1819044 0.7624798
## ENSG00000015479 MATR3 1 2.700294 -1.4485729 0.7837425
## ENSG00000023516 AKAP11 1 2.415084 -0.6120681 -0.3849580
## ENSG00000025156 HSF2 1 2.528111 -0.8125698 -0.6166004
Then IReNA infers the potential regulation for each gene in the expression profile through GENIE3, one of the top performing method for network inference as listed in Beeline benchmark. Futhermore, based on smoothed expression profile, regulation type of each gene pair is defined by Person’s correlation, correlation > 0 reprsents activation, correlation < 0 reprsents repression.
<- GENIE3(as.matrix(seurat_with_time@assays$RNA@data),nCores = 50)
weightMat <- getLinkList(weightMat)
weightMat <- weightMat[weightMat[,3]>0.0002,]
regulation ### add regulation type for each gene pair
<- add_regulation_type(Kmeans_clustering_ENS,regulation)
regulatory_relationships ### check whether source genes are transcription factors
<- c()
motifTF for (i in 1:nrow(motif1)) {
<- strsplit(motif1[i,5],';')[[1]]
TF <- c(motifTF,TF)
motifTF
}<- regulatory_relationships[regulatory_relationships[,1] %in% motifTF,] regulatory_relationships
However, GENIE3 costs a huge of running time and memory compare to Person’s correlation. So, we also provide functions to infer potential regulation through Person’s correlation.
<- get_cor(Kmeans_clustering_ENS, motif = motif1, correlation_filter = 0.6, start_column = 4) regulatory_relationships
Part 2: Analyze bulk ATAC-seq data to refine regulatory relationships (with bulk ATAC-seq data)
For ATAC-seq data is available, IReNA calculates the footprints of transcription factors and footprint occupancy score (FOS) to refine regulatory relationships. The footprints whose distance is less than 4 are merged and then the sequence of each footprint is obtained from the reference genome through the function ‘get_merged_fasta’. The reference genome should be in fasta/fa format which can be downloaded from UCSC or other genome database.
The footprint file used below is available in goole drive
The human genome used below can be downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
###merge footprints whose distance is less than 4
<- read.table('footprints.bed',sep = '\t')
filtered_footprints <- 'Genome/hg38.fa'
fastadir <- get_merged_fasta(filtered_footprints,fastadir)
merged_fasta write.table(merged_fasta,'merged_footprints.fasta',row.names=F,quote=F,col.names=F)
After obtaining the motif sequences, use fimo software to identify binding motifs in the footprints. Because fimo software only supports linux environment, we generate a shell script to run Fimo software.
First, identify differentially expressed genes related motifs through the function ‘motif_select’, which will reduce the running time of the subsequent analysis process.
Then, you should set the following five parameters:
fimodir: path of Fimo software. If Fimo has been set to the global environment variable, just set ‘fimodir <- fimo’.
outputdir1: output path of the shell scripts.
outputdir: output path of Fimo result.
motifdir: path of the motif file, which can be downloaded from https://github.com/jiang-junyao/MEMEmotif or TRANSFAC database.
sequencedir: path of the sequence which is generated by the function ‘get_merged_fasta’.
Please note that, at the end of ‘outputdir’ and ‘sequencedir’ the symbol ’/’should be contained.
### Identify differentially expressed genes related motifs
<- motifs_select(Tranfac201803_Hs_MotifTFsF, rownames(Kmeans_clustering_ENS)) ###Kmeans_clustering_ENS was obtained in part1
motif1 ### run find_motifs()
<- 'fimo'
fimodir <- '/public/home/user/fimo/output/'
outputdir1 <- '/public/home/user/fimo/output/'
outputdir <- '/public/home/user/fimo/Mememotif/'
motifdir <- '/public/home/user/fimo/merged_footprints.fasta'
sequencedir find_motifs(motif1,step=20,fimodir, outputdir1, outputdir, motifdir, sequencedir)
### run fimo_all script in shell
<- paste0('sh ',outputdir1,'Fimo_All.sh')
shell_code system(shell_code,wait=TRUE)
### delete shell scripts
<- paste0('rm ',outputdir1,'Fimo*.sh')
shell_code2 system(shell_code2,wait=TRUE)
Then, we combine these Fimo consequence in ‘outputdir’. Notably outputdir folder should only contain fimo result files. Next, we load the peaks file and overlap differential peaks and motif footprints through overlap_footprints_peaks() function
###Combine all footprints of motifs
<- combine_footprints(outputdir)
combined <- read.delim('differential_peaks.bed')
peaks <- overlap_footprints_peaks(combined,peaks) overlapped
Next, we intergrate bioconductor package ChIPseeker to get footprint-related genes. Before we run get_related_genes(), we need to specify TxDb, which can be download from: Bioconductor TxDb list. Kmeans_clustering_ENS used here was obtained in part1.
###get footprint-related genes
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
<- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb <- get_related_genes(overlapped,txdb = txdb,motif=Tranfac201803_Hs_MotifTFsF,Species = 'Hs')
list1 ###Get candidate genes/TFs-related peaks
<- get_related_peaks(list1,Kmeans_clustering_ENS)
list2 ### output filtered footprints
write.table(list2[[1]],'filtered_footprints.bed', quote = F, row.names = F, col.names = F, sep = '\t')
Then, because of the size of original BAM file is too large, so we need to use samtools to extract footprints realated regions in BAM to reduce running time of function which analyze bam files in IReNA. (If you use our test data, just skip this step)
### run samtools in shell
shell_code1 <- 'samtools view -hb -L filtered_footprint.bed SSC_patient1.bam > SSC1_filter.bam'
shell_code2 <- 'samtools view -hb -L filtered_footprint.bed SSC_patient2.bam > SSC2_filter.bam'
shell_code3 <- 'samtools view -hb -L filtered_footprint.bed esc.bam > esc_filter.bam'
system(shell_code1,,wait=TRUE)
system(shell_code2,,wait=TRUE)
system(shell_code3,,wait=TRUE)
Optional Locate the center on the Tn5 cut site thorugh R package ATACseqQC. ATACseqQC shifts reads on the forward strand by +4 bp and reads on the reverse strand by -5 bp from the bam-files.
library(ATACseqQC)
library(Rsamtools)
<- 'SSC1_filter.bam'
bamfilepath1 <- 'SSC2_filter.bam'
bamfilepath2 <- 'esc_filter.bam'
bamfilepath3 indexBam(bamfilepath1)
<- readBamFile(bamfilepath1, tag=tags, asMates=TRUE, bigFile=TRUE)
gal1 <- readBamFile(bamfilepath2, tag=tags, asMates=TRUE, bigFile=TRUE)
gal2 <- readBamFile(bamfilepath3, tag=tags, asMates=TRUE, bigFile=TRUE)
gal3 <- shiftGAlignmentsList(gal, 'SSC1_filter_shift.bam')
galout1 <- shiftGAlignmentsList(gal, 'SSC2_filter_shift.bam')
galout2 <- shiftGAlignmentsList(gal, 'esc_filter_shift.bam') galout3
Furthermore, we count the cuts of each position in footprints by wig_track(), and use these cuts to calculate the FOS of footprints to identify enriched TFs which determine the regulatory relationship. regulatory_relationships used here was calculated in part1. Parameter FOS_threshold used here is the threshold to filter low quality footprints, you can increase it to reduce the number of exported footprints.
### calculate cuts of each each position in footprints
<- 'SSC1_filter.bam'
bamfilepath1 <- 'SSC2_filter.bam'
bamfilepath2 <- 'esc_filter.bam'
bamfilepath3 ### set parameter 'workers' to make this function run in parallel
<- cal_footprint_cuts(bamfilepath = bamfilepath1,bedfile = list2[[1]],workers = 40,index_bam = T)
cuts1 <- cal_footprint_cuts(bamfilepath = bamfilepath2,bedfile = list2[[1]],workers = 40,index_bam = T)
cuts2 <- cal_footprint_cuts(bamfilepath = bamfilepath3,bedfile = list2[[1]],workers = 40,index_bam = T)
cuts3 <- list(cuts1,cuts2,cuts3)
cut_list ### get related genes of footprints with high FOS
<- Footprints_FOS(cut_list,list2[[2]], FOS_threshold = 0.1)
potential_regulation ### Use information of footprints with high FOS to refine regulatory relationships
<- filter_ATAC(potential_regulation,regulatory_relationships) filtered_regulatory
Part 3: Regulatory network analysis and visualization
After we get ‘filtered_regulatory_relationships’ and ‘Kmeans_clustering_ENS’, we can reconstruct regulatory network. Run network_analysis() to get regulatory, this function will generate a list which contain the following 9 elements:
(1)Cor_TFs.txt: list of expressed TFs in the gene networks.
(2)Cor_EnTFs.txt: list of TFs which significantly regulate gene modules (or enriched TFs).
(3)FOSF_RegMTF_Cor_EnTFs.txt: regulatory pairs in which the source gene is enriched TF.
(4)FOSF_RegMTF_Cor_EnTFs.txt: regulatory pairs in which both source gene and target gene are enriched TFs.
(5)FOSF_RegMTF_Cor_EnTFs.txt: regulatory pairs only including regulations within each module but not those between modules, in this step
(6)TF_list: enriched TFs which significantly regulate gene modules
(7)TF_module_regulation: details of enriched TFs which significantly regulate gene modules
(8)TF_network: regulatory network for enriched transcription factors of each module
(9)intramodular_network: intramodular regulatory network
Here, we use refined regulatory relationships from part2 to reconstruct regulatory networks
<- network_analysis(filtered_regulatory_relationships,Kmeans_clustering_ENS,TFFDR1 = 10,TFFDR2 = 10) TFs_list
We can also make enrichment analysis for differentially expressed genes in each module. Before you run this function, you need to download the org.db of your species through BiocManager.
### Download Homo sapiens org.db
#BiocManger::install('org.Hs.eg.db')
library(org.Hs.eg.db)
### Enrichment analysis (KEGG)
<- enrich_module(Kmeans_clustering_ENS, org.Hs.eg.db, enrich.db = 'KEGG',organism = 'hsa')
enrichment_KEGG #enrichment_GO <- enrich_module(Kmeans_cluster_ENS, org.Hs.eg.db, 'GO')
Moreover, you can do GO analysis
library(org.Hs.eg.db)
### Enrichment analysis (GO)
enrichment_GO <- enrich_module(Kmeans_clustering_ENS, enrich.db = 'GO',org.Hs.eg.db)
You can visualize regulatory networks for enriched transcription factors of each module through plot_network() function by setting type parameter as ‘TF’. This plot shows regulatory relationships between transcription factors in different modules that significantly regulating other modules. The size of each vertex determine the significance of this transcription factor. Yellow edges are positive regulation, grey edges are negative regulation.
plot_tf_network(TFs_list)
You can visualize intramodular network with enriched function through plot_intramodular_network() function. Before run this function, you can select enriched functions of each module that you want to present in the plot. If you input all enriched functions, this function will automatically select the function with the highest -log10(qvalue) in each module to present in the plot. What’s more transcription factor with the most edge numbers in each module will be presented in the plot too.
### select functions that you want to present in the figure
<- enrichment_GO[c(1,11,21),]
enrichment_GO_select ### plotting
plot_intramodular_network(TFs_list,enrichment_GO_select,layout = 'circle')
It is strongly recommended to use Cytoscape to display the regulatory networks. We provide a function that can provide different Cytoscape styles. You need to install and open Cytoscape before running the function.
###optional: display the network in cytoscape, open cytoscape before running this function
initiate_cy(TFs_list, layout1='degree-circle', type='TF')
initiate_cy(TFs_list, layout1='grid', type='module')
These are the picture we processed through Cytoscape, which can show the regulatory relationship of modularized transcription factors.
Use Cytoscape to visualize regulatory network for enriched transcription factors of each module
Use Cytoscape to visualize intramodular network