CACIMAR tutorial

Jiang junyao, Li jinlian

2024-05-14

Installation

Follow the steps below to install CACIMAR package from GitHub and run it:

# install
# install.packages("devtools") # Install devtools if not already installed
devtools::install_github("jiang-junyao/CACIMAR")
# require
library(CACIMAR)
library(reshape2)

Introduction

In this tutorial, we analyze retina regeneration scRNA-seq data from mouse and zebrafish to identify conserved markers, cell types, intracellular regulations and intercellular interactions in the retina between mice and zebrafish. This tutorial contains following 5 steps: (1) Identify markers; (2) Identify evolutionarily conserved cell types; (3) Identify evolutionarily conserved markers; (4) Identify conserved intracellular regulation; (5) Identify conserved intercellular interactions (cell-cell interaction)

Tutorial

In this tutorial, we analyze retina regeneration scRNA-seq data from mouse and zebrafish to identify conserved markers, cell types, intracellular regulations and intercellular interactions in the retina between mice and zebrafish. This tutorial contains following 5 steps:

Input data of this tutorial

Seurat object of mouse and zebrafish

Seurat object should include cell types information in the ‘active.ident’ slot. Test data can be download from Google drive

Homologous gene database

A homologous gene database is required. CACIMAR has already generated a homologous gene database for human, mouse, zebrafish, and chicken. You can use the following code to load the pre-built database, or you can generate your own homologous gene database using the instructions provided in the “Build homologous gene database” section of the tutorial. To load the pre-built database:

### load homologous gene database of human and mouse
OrthG <- OrthG_Hs_Mm
### load homologous gene database of human and chicken
OrthG <- OrthG_Hs_Ch
### load homologous gene database of human and zebrafish
OrthG <- OrthG_Hs_Zf
### load homologous gene database of mouse and chicken
OrthG <- OrthG_Mm_Ch
### load homologous gene database of mouse and zebrafish
OrthG <- OrthG_Mm_Zf
### load homologous gene database of zebrafish and chicken
OrthG <- OrthG_Zf_Ch

Gene regulatory network (for step4)

Using the following code to load cell type specific networks

load(system.file("extdata", "gene_network.rda", package = "CACIMAR"))

Using the following code to load modularized regulatory networks

Step1: Identify markers

In this part, CACIMAR first used receiver operating characteristic (ROC) based method in ‘FindAllMarkers’ function of Seurat to identify markers in each cluster for every species individually. Then, based on marker genes identified above, CACIMAR calculates the power of markers(MP, MP= 2 × abs(AUC-0.5)) in each cluster and the expression differences of markers(logFC) between clusters. Markers with MP > 0.35 and logFC > 0.1 will be retained.

### identify markers
Zf_seurat <- readRDS('Zf_seurat.rds')
Mm_seurat <- readRDS('Mm_seurat.rds')
zf_marker <- Identify_Markers(Zf_seurat,PowerCutoff=0.35, DifferenceCutoff=0.1)
mm_marker <- Identify_Markers(Mm_seurat,PowerCutoff=0.35, DifferenceCutoff=0.1)

Step2:Identify evolutionarily conserved cell types

CACIMAR identifies evolutionarily conserved cell types and poorly conserved cell types utilizing conservation scores of cell types (CSCT) and a phylogenetic tree.

CSCT is computed based on the powers (or MP values) of shared markers in each pair of cell types, relative to the sum of the powers of all markers in each pair of cell types. Shared markers indicate the markers of one species that have homologs in the other species. If genes have multiple homologs, the gene with the highest power is selected. If the pair of cell types is mutually matched, and CSCT is greater than the third quartile of CSCT values, then this pair of cell types is considered a conserved cell type between two species. Mutually matched pair cell types mean that for every cell type in species 1, we look for the corresponding cell type in species 2 that has the highest CSCT value, and vice versa. The corresponding cell types in both species are said to be “mutually matched,” and they have the highest CSCT value relative to all other cell types in the other species.

Besides, we generate a phylogenetic tree based on the CSCT value using the hierarchical clustering method UPGMA (unweighted pair group method using arithmetic mean) to further identify the conserved cell types and the poorly conserved cell types. If cell types from two species are from the same clade of bipartition, the two cell types are considered conserved cell types across species. However, if these conserved cell types are only identified by the phylogenetic tree of cell types, they are regarded as poorly conserved cell types.

If you haven’t run step1, you can just use the following code to get input data

load(system.file("extdata", "zf_mm_markers.rda", package = "CACIMAR"))
### identify evolutionarily conserved cell types based on conservation score of cell types (CSCT)
OrthG <- OrthG_Mm_Zf
expression <- Identify_ConservedCellTypes(OrthG, zf_marker, mm_marker,'zf','mm')
SCT_matrix <- expression[[2]]
SNT_h <- expression[[2]][grep('mm',rownames(SCT_matrix)),as.numeric(grep('zf',colnames(SCT_matrix)))]
### show the CSCT value with a heatmap
Heatmap_Cor(SNT_h,cluster_cols=F, cluster_rows=F,Color1 = c(rgb(102/255,46/255,115/255),rgb(31/255,153/255,139/255),rgb(251/255,232/255,48/255)))

Then, we identify conserved cell type based on the CSCT.

conserved_celltypes = identify_conserved_pair(SNT_h)
print(conserved_celltypes)
##  [1] "mmResting MG-zfResting MG"     "mmMicroglia-zfMicroglia"      
##  [3] "mmV/E cells-zfV/E cells"       "mmGABAergic AC-zfGABAergic AC"
##  [5] "mmActivated MG-zfActivated MG" "mmRod BC-zfCone BC"           
##  [7] "mmCones-zfCones"               "mmRods-zfRods"                
##  [9] "mmRGC-zfRGC"                   "mmRPE-zfRPE"                  
## [11] "mmPericytes-zfPericytes"

Then we can evaluate evolutionarily conserved cell types and poorly conserved cell types through phylogenetic tree. By default, distinct fonts are employed for tip labels to indicate the evolutionary conservation status of cell types. This allows for differentiation between evolutionarily conserved cell types(bold.italic), poorly conserved cell types(italic), and unconserved cell types(plain). To customize the colors for the tip labels, you can utilize the tiplab_cols parameter and specify the desired color scheme. The default color scheme assigns dark blue to represent evolutionarily conserved cell types, light blue to represent poorly conserved cell types, and grey to represent unconserved cell types. Additionally, if you wish to change the method used for constructing the phylogenetic tree, you can modify the tree_method parameter to “classic Neighbor-Joining” method.

### get the conserved cell types based on the mutually matched and bigger than 3/4 CSCT score
conserved_hm_celltypes <- get_conserved_hm_celltypes(SNT_h)
### used my pointed colors
load(system.file("extdata", "bind_colors.rda", package = "CACIMAR"))
### generate a phylogenetic tree
species.vector <- substr(rownames(SCT_matrix), 1, 2)
p <- Plot_phylogenetic_tree(SCT_matrix = SCT_matrix,
                            species.vector = species.vector, 
                            conserved_hm_celltype = conserved_hm_celltypes,
                            annotation_colors_df = bind_colors,
                            colors_labels = c("Mouse", "Zebrafish")) 
phylogenetic tree of cross-species cell types
phylogenetic tree of cross-species cell types

Step3:Identify evolutionarily conserved markers

CACIMAR initially employs a homologous gene database to enhance the markers identified by the “FindAllMarkers” function using ROC. Markers that are present in the homologous database are then considered as shared markers. Subsequently, CACIMAR identifies markers present in the conserved cell types from step2 and retains them as evolutionarily conserved markers.

load(system.file("extdata", "zf_mm_markers.rda", package = "CACIMAR"))
conserved_markers = Identify_ConservedMarkers(OrthG_Mm_Zf,mm_marker,zf_marker,
                                              'mm','zf',conserved_celltypes)

Plot conserved markers

mm_conserved_markers = mm_marker[conserved_markers$mmgene,6:22]
mm_conserved_markers = cbind(conserved_markers$mmcluster,mm_conserved_markers)
zf_conserved_markers = zf_marker[conserved_markers$zfgene,6:20]
zf_conserved_markers = cbind(conserved_markers$zfcluster,zf_conserved_markers)
zf_conserved_markers = zf_conserved_markers[order(zf_conserved_markers[,1]),]
### plot mouse conserved markers
Plot_MarkersHeatmap(mm_conserved_markers,cellheight = 1,cellwidth = 6,
                    legend = T)

### plot zebrafish conserved markers
Plot_MarkersHeatmap(zf_conserved_markers,cellheight = 1,cellwidth = 6,
                    legend = T)

Step4: Identify conserved intracellular regulation

We employed CACIMAR to assess the evolutionary conservation of two separate regulatory networks: cell type-specific regulatory networks and regulatory subnetworks, otherwise known as modules, across various species. CACIMAR calculates the conservation score of regulatory networks (CSRN) for different cell types/modules in regulatory networks of distinct species. CSRN is computed based on two criteria: (I) the proportion of homologous genes (nodes) compared to all genes, and (II) the proportion of interactions (edges) among homologous genes compared to all interactions. A higher CSRN indicates a more conserved regulatory network.

Assess the evolutionary conservation of cell type-specific regulatory networks

The following cell type specific regulatory networks are constructed by IReNA, which can integrate scRNA-seq and scATAC-seq/ATAC-seq data or use scRNA-seq/RNA-seq data alone to construct modularized regulatory networks. You can use the following codes to load test networks.

load(system.file("extdata", "gene_network.rda", package = "CACIMAR"))
ConservedNetworks_ct <- identify_ct_ConservedNetworks(OrthG, Species1_GRN, Species2_GRN, 'mm', 'zf')
Heatmap_Cor(ConservedNetworks_ct[[2]], cluster_cols=F, cluster_rows=F, Color1 = c(rgb(102/255,46/255,115/255), rgb(31/255,153/255,139/255), rgb(251/255,232/255,48/255)))
CSRN of cell type-specific regulatory networks
CSRN of cell type-specific regulatory networks

Assess the evolutionary conservation of Modularized regulatory networks

Modular regulatory networks used here are constructed by R package IReNA, which can integrate scRNA-seq and scATAC-seq/ATAC-seq data or use scRNA-seq/RNA-seq data alone to construct modularized regulatory networks. WGCNA is an optional method that can generate the input. However, WGCNA only can construct regulatory networks through bulk RNA-seq data.

ConservedNetworks <- Identify_ConservedNetworks(OrthG, mm_gene_network, zf_gene_network, 'mm', 'zf')
Heatmap_Cor(ConservedNetworks[[2]], cluster_cols=F, cluster_rows=F, Color1 = c(rgb(102/255,46/255,115/255), rgb(31/255,153/255,139/255), rgb(251/255,232/255,48/255)))
CSRN of modularized regulatory networks
CSRN of modularized regulatory networks

Step5:Identify conserved intercellular interactions (cell-cell interaction)

CACIMAR employs the SingleCellSignalR algorithm from liana to identify evolutionarily conserved cell-cell interactions (CCI) based on the expression levels of ligands and receptors. CCI that satisfy the following criteria are regarded as evolutionarily conserved: (I) the interactions between ligands and receptors demonstrate consistency, and (II) the cell type of the ligand and receptor in species 1 should correspond to the cell type of the ligand and receptor in species 2, respectively. Subsequently, we compute the proportion of consistent intercellular interactions to gauge the conservation score for intercellular interactions (CSII).

Perform cell-cell interaction analysis with SingleCellSignalR algorithm

# By default, we contained the interactions that have a LRscore greater than 0.5 and scale the LRscore
SingleCellSignalR_mouse_result <- perform_CCI_analysis(seurat_obj=Mm_seurat, target_organism=10090)
SingleCellSignalR_zebrafish_result <- perform_CCI_analysis(seurat_obj=Zf_seurat, target_organism=7955)

Overall cell-cell interactions profile across two species

Here, we summarize the weight of all interactions within each pair of cell types and display the profile with a sankey plot. The left side represents the source of ligands in mouse, the right side represents the source of ligands in zebrafish, and the middle represents all the target receptors shared by both species.

load(system.file("extdata", "SingleCellSignalR_results.rda", package = "CACIMAR"))
# sum Weights of each cell types
all_weight_df_long_Mm_zf <- calculate_Weights(species1_cci = SingleCellSignalR_mouse_result,
                                              species2_cci = SingleCellSignalR_zebrafish_result,
                                              specie_name1 = "mm",
                                              specie_name2 = "zf")

# sankey plot will be saved in current directory by default
# If the colors_file is set to NULL, the colors will be automatically assigned.
create_sankey(links = all_weight_df_long_Mm_zf[, c("Source2", "target", "scale_weight")],
              output_file = "Mm_Zf_sankey.html",
              specie_name1 = "mm",
              specie_name2 = "zf")
cell-cell interaction weight profile across species
cell-cell interaction weight profile across species

Identified the evolutionarily conserved cell-cell interactions (CCI)

If you don’t want to run the cell-cell interaction above, you can load test data with load(system.file(“extdata”, “mouse_cci_test.rda”, package = “CACIMAR”)) for this part to caculate the conservation score for intercellular interactions(CSII).

##### ChordDiagram, two species
## caculate the conservation score for intercellular interactions(CSII)
# mm and zf conserved score of interaction
conserved_cell_types_mm_zf <- data.frame("mm" = c("mmRods", "mmRod BC", "mmCones", "mmPericytes", "mmV/E cells", "mmRGC", "mmGABAergic AC", "mmRPE", "mmResting MG", "mmActivated MG", "mmMicroglia", "mmGlycinergic AC", "mmHorizontal cells"), "zf" = c("zfRods", "zfCone BC", "zfCones", "zfPericytes", "zfV/E cells", "zfRGC", "zfGABAergic AC", "zfRPE", "zfResting MG", "zfActivated MG", "zfMicroglia", "zfGlycinergic AC", "zfHorizontal cells"))
conserved_result_mm_zf <- Identify_Conserved_CCI1(species1_cci=SingleCellSignalR_mouse_result,
                                                 species2_cci=SingleCellSignalR_zebrafish_result,
                                                 conserved_cell_types_df=conserved_cell_types_mm_zf,
                                                 species_name1 = "mm",
                                                 species_name2 = "zf")
cci_conserved_Weights_table_mm_zf <- Caculate_cell_pair_cci_score(conserved_result_df=conserved_result_mm_zf,
                                                                  species1_cci = SingleCellSignalR_mouse_result,
                                                                  species2_cci = SingleCellSignalR_zebrafish_result,
                                                                  conserved_cell_types_df = conserved_cell_types_mm_zf,
                                                                  species_name1 = "mm",
                                                                  species_name2 = "zf")
cci_data_mm_zf <- Make_ChordDiagram_data(cci_conserved_Weights_table = cci_conserved_Weights_table_mm_zf, species_name1 = "mm", species_name2 = "zf")
## ChordDiagram shows the CSII
ChordDiagram(net = cci_data_mm_zf, filename = paste("mm", "zf", "chordDiagram.pdf", sep = "_"))
conservation score for intercellular interactions
conservation score for intercellular interactions

Show the conserved ligand-receptor pairs

# We display the ligand-receptor pairs in the cell type pairs where the CSII is greater than or equal to the third quantile value of CSII by default
merge_avg_width_df <- make_pheatmap_LR_data1(cci_conserved_results = conserved_result_mm_zf,
                                            seurat_object_sp1 = Mm_seurat,
                                            seurat_object_sp2 = Zf_seurat,
                                            avg_group = "cellname",
                                            species_name1 = "mm",
                                            species_name2 = "zf",
                                            cci_conserved_Weights = cci_conserved_Weights_table_mm_zf)
pheatmap_LR1(width_df = merge_avg_width_df$merge_width_df_width, filename = "lr_avg_pheatmap.pdf")
conserved ligand-receptor pairs
conserved ligand-receptor pairs

Addtional function: Build homologous gene database

Under the situation that homologous gene database of your interested species is not included in our package, CACIMAR provides to build your own homologous gene database according to MGI database.

Hint Only github version of CACIMAR supports this part

(1) Inputs data of this part

1) 'mm' for mouse; 
2) 'zf' for zebrafish; 
3) 'hs' for human; 
4) 'ch' for chicken; 
5) 'cf' for dog; 
6) 'pt' for chimpanzee; 
7) 'xt' for frog; 
8) 'rn' for rat; 
9) 'bt' for cattle; 
10) 'rh' for macaque

(2) Example for build homologous gene database

### download data
download.file(destfile = 'D:\\GIBH\\platform\\OrthG database/HOM.txt',url = 'http://www.informatics.jax.org/downloads/reports/HOM_AllOrganism.rpt')
### load data
HOM=read.delim('D:\\GIBH\\platform\\OrthG database/HOM.txt')
### build homologous gene database for dog and mouse
Species_name1 <- 'cf'
Species_name2 <- 'mm'
OrthG_Cf_Mm <- buildHomDatabase(HOM, Species_name1, Species_name2)

Addtional function: Identify conserved gene based on gene list from two species

mm_gene = c('Sox2','Oct4')
zf_gene = c('sox2','ube2')
identify_conserved_gene(OrthG_Mm_Zf,mm_gene,zf_gene,'mm','zf')