--- title: "Brain cell type proportion analysis using BRETIGEA" author: "Andrew McKenzie, Minghui Wang, Bin Zhang" date: "`r Sys.Date()`" output: pdf_document: number_sections: yes toc: yes vignette: > %\VignetteIndexEntry{Basic BRETIGEA Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} references: - id: memoryagingtbi2016 title: Allen Institute for Cell Science. Aging, Dementia and TBI URL: 'http://aging.brain-map.org/' year: 2016 --- # Background for BRETIGEA Several comprehensive RNA-seq experiments in different brain cell types have now been published in humans and mice. Some of these experiments have profiled gene expression of cell populations isolated through immunopanning procedures. Immunopanning involves immunoprecipitation of particular cell types in cell culture plates, based on selection for an antibody adsorbed to the plate surface. Others studies have performed RNA profiling of single cells with microfluidics devices and used clustering methods to identify cell types from the resulting RNA expression profiles. The devices used for single cell RNA sequencing (scRNA-seq) often select cells based on size or via encapsulation in a droplet and involve the creation of a cDNA library from the transcriptome from a theoretical maximum of one cell. Existing studies have been mainly based on individual datasets, and are therefore subject to systematic noise, including sampling bias due to sample collection or preparation technique, as well as stochasticity in gene expression. As an increasing number of RNA-seq cell type-specific transcriptomic experiments have become available for both human and mouse, we set out to conduct a comprehensive meta-analysis of brain cell type gene signatures, which is now published in McKenzie et al (2018), . We created cell type-specific (i.e. marker) gene signatures for six cell types: astrocytes (ast), endothelial cells (end), microglia (mic), neurons (neu), oligodendrocytes (oli), and oligodendrocyte precursor cells (opc). The goal of our cell type specificity measure, which is fully described in our manuscript, is to measure whether a gene is expressed in only one cell type relative to the others. The five data sets used in the creation of the cell type marker signatures can [be found in the manuscript](https://www.nature.com/articles/s41598-018-27293-5/tables/1). # Introduction to BRETIGEA A major goal of **BRETIGEA** (BRain cEll Type specIfic Gene Expression Analysis) is to simplify the process of defining your own set of brain cell type marker genes by using a well-validated set of cell type-specific marker genes derived from both immunopanning and single cell microfluidic experiments, as described in McKenzie et al (2018), . There are brain cell type markers available that have been developed from human data, mouse data,, and combinations using data from both species (the default). Notably, if you use your own marker data, the functions in **BRETIGEA** are applicable to bulk gene expression data from any tissue. This vignette shows how you can perform cell type proportion estimation and adjustment on your own bulk gene expression data. # Data loading and input format First, we will load the package and read in example bulk RNA-sequencing data from four brain regions (frontal white matter, temporal cortex, parietal cortex, and hippocampus), which was generated by the Allen Brain Atlas [@memoryagingtbi2016] and filtered to contain primarily brain marker genes. We also will load a data frame with additional immunohistochemistry quantification measurements from each brain sample, to use as a validation of the method. ```{r, results="hide", warning=FALSE, message=FALSE} library(BRETIGEA, quietly = TRUE) library(knitr) #only used for vignette creation ``` Here is the format of the inputs: ```{r, message = FALSE} str(aba_marker_expression, list.len = 5) str(aba_pheno_data, list.len = 5) ``` # Relative cell type proportion estimation To run the brain cell type proportion estimation analysis and extract the matrix of surrogate proportion variables for each of the major six brain cell types (astrocytes, endothelial cells, microglia, neurons, oligodendrocytes, and OPCs), run this: ```{r, message = FALSE} ct_res = brainCells(aba_marker_expression, nMarker = 50) kable(head(ct_res)) ``` ## Selecting the nMarker parameter Note that the above analysis uses *nMarker* = 50 marker genes. A notable trade-off in the selection of the number of marker genes to include in the analysis is that the more marker genes you use, the more likely you are to average out any cell type-specific expression changes that may occur across groups in your sample. On the other hand, the fewer marker genes you use, the higher-quality these marker genes will tend to be in terms of strength of cell type specificity. We have chose *nMarker* = 50 because it has been a reasonable number in our experience, but the goals of your analysis may differ and you may want to choose a different number of marker genes for each cell type. Note that only marker genes which have been measured in your data set will be used by the cell type proportion estimates, so if your data set has fewer gene measurements (e.g., in a proteomics data set), that may be a reason to use fewer marker genes. Comparing these cell type proportion estimates to the independent immunohistochemistry quantifications of two marker genes (IBA1 and GFAP), you can see that the correlation is strong. ```{r, fig.width = 7, fig.height = 7, message = FALSE, warning = FALSE} cor_mic = cor.test(ct_res[, "mic"], as.numeric(aba_pheno_data$ihc_iba1_ffpe), method = "spearman") print(cor_mic) cor_ast = cor.test(ct_res[, "ast"], as.numeric(aba_pheno_data$ihc_gfap_ffpe), method = "spearman") print(cor_ast) ``` The default cell type proportion estimation method is singular value decomposition, but if you want to use PCA, that is an option as well. ```{r, message = FALSE} ct_res = brainCells(aba_marker_expression, nMarker = 50, species = "combined", method = "PCA") kable(head(ct_res)) ``` The *species* argument controls which species the marker genes are derived from, and can be set to "human" and "mouse" for data specific to those species. If you want to only estimate the proportion of particular cell types, you can do so by setting the *celltypes* argument. Here, we only estimate the proportions of astrocytes, neurons, and oligodendrocytes. Note that the estimates of each cell type is done independently, so choosing to estimate the proportions of one cell type or not will not affect the estimates of the other cell types. ```{r, message = FALSE} ct_res = brainCells(aba_marker_expression, nMarker = 50, species = "combined", celltypes = c("ast", "neu", "oli")) kable(head(ct_res)) ``` # Using alternative cell type marker genes from Kelley *et al.* In addition to the default data set built from a meta-analysis across cell type-specific gene expression data, BRETIGEA also offers access to cell type markers based on leveraging variation across intact tissue samples. The cell types for which markers are available based on this data set are astrocytes, neurons, microglia, and oligodendrocytes. To use this, change the *data_set* parameter to "kelley" (referring to Kelley *et al.*, 2018, PMID: 30154505) when you call brainCells(). Note that the *species* argument will be ignored if *data_set* is set to "kelley". ```{r, message = FALSE} ct_res = brainCells(aba_marker_expression, nMarker = 50, data_set = "kelley") kable(head(ct_res)) ``` In the Allen Brain Atlas RNA-seq data, the estimated proportions are overall very similar between the "mckenzie" and "kelley" data sets. ```{r, message = FALSE} ct_res_mckenzie = brainCells(aba_marker_expression, nMarker = 50, data_set = "mckenzie", species = "human") ct_res_kelley = brainCells(aba_marker_expression, nMarker = 50, data_set = "kelley") cell_types_compare = colnames(ct_res_kelley) for(i in 1:length(cell_types_compare)){ cor_res = cor.test(ct_res_mckenzie[ , cell_types_compare[i]], ct_res_kelley[ , cell_types_compare[i]], method = "spearman") df_compare_cor = data.frame(Cell = cell_types_compare[i], Rho = cor_res$estimate, PVal = cor_res$p.value) if(i ==1) df_compare_cor_tot = df_compare_cor if(i > 1) df_compare_cor_tot = rbind(df_compare_cor_tot, df_compare_cor)} kable(df_compare_cor_tot) ``` This alternative data set also offers marker genes derived from several specific brain regions: ```{r, message = FALSE, echo = FALSE} print(unique(unlist(lapply(strsplit(unique(kelley_df_brain$cell)[-c(1, 2, 3, 4)], "_"), "[[", 1)))) ``` # Using your own cell type marker genes If you have access to your own marker genes, you can use the *findCells* function instead. This has the same functionality otherwise; *brainCells* is simply a wrapper function for users who want to use the brain cell type marker genes that are provided by **BRETIGEA**. Note the format of the *markers* data frame: you must have one column with the gene symbol, named *markers*, and one column with the corresponding cell type, named *cell*. ```{r, fig.width = 6, fig.height = 5, message = FALSE, warning = FALSE} str(markers_df_brain) ct_res = findCells(aba_marker_expression, markers = markers_df_brain, nMarker = 50) kable(head(ct_res)) ``` # Adjusting bulk gene expression data for estimated cell type proportions **BRETIGEA** also offers users the ability to adjust their original gene expression matrices for the estimated cell type proportion estimates, in order to deconvolute the signal. ```{r, fig.width = 6, fig.height = 5, message = FALSE, warning = FALSE} brain_cells_adjusted = adjustBrainCells(aba_marker_expression, nMarker = 50, species = "combined") expression_data_adj = brain_cells_adjusted$expression ``` Note that *adjustBrainCells* is a wrapper function to *adjustCells* and if you have your own markers (e.g., for a non-brain data set), then you can use that interface instead for deconvolution of more general cell types. As you can see, following adjustment, there is no longer a correlation between the RNA expression of the microglia marker gene AIF1 and its encoded protein IHC quantification (IBA1), nor between the RNA and protein expression of the astrocyte marker gene GFAP. (Note there *is* a non-significant trend towards a residual correlation here, which may be because GFAP is not a perfect marker of astrocyte proportion in this data set, but instead varies across samples based on disease state, region, and other factors). ```{r, fig.width = 6, fig.height = 5, message = FALSE, warning = FALSE} cor.test(as.numeric(aba_marker_expression["AIF1", ]), as.numeric(aba_pheno_data$ihc_iba1_ffpe), method = "spearman") cor.test(expression_data_adj["AIF1", ], as.numeric(aba_pheno_data$ihc_iba1_ffpe), method = "spearman") cor.test(as.numeric(aba_marker_expression["GFAP", ]), as.numeric(aba_pheno_data$ihc_gfap_ffpe), method = "spearman") cor.test(expression_data_adj["GFAP", ], as.numeric(aba_pheno_data$ihc_gfap_ffpe), method = "spearman") ``` # Help and other resources If you have any problems with or questions about using this package, please open an issue on Github or contact the package maintainer. **References**