What is KeyGenes

KeyGenes is an algorithm to predict the identity and determines identity scores of queried samples (test set) based on a provided group of samples (training set). It uses transcriptional profiles of the test set and matches them to the transcriptional profiles of the training set. KeyGenes uses 10-fold cross validation on the basis of LASSO (Least Absolute Shrinkage and Selection Operator) regression available in the glmnet R-package (Friedman et al., 2010).

Included training data

Included in the KeyGenes package is a collection of RNA-seq count data usable for training. These data have been collected in a SummarizedExperiment object.

library(KeyGenes)
data("training.data")
training.data
## class: SummarizedExperiment 
## dim: 63670 176 
## metadata(0):
## assays(1): counts
## rownames(63670): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(176): hESC stem cells.DH3 hESC stem cells.DH6 ...
##   muscle_adult gonad_adult.10
## colData names(2): age organ

This dataset contains hESC samples (Forster et al., 2014; Roost et al., 2015), 1st and 2nd trimester fetal samples (Roost et al., 2015) and adult samples (Cnop et al., 2014; Fagerberg et al., 2014; Illumina Body Map 2.0). Each sample is annotated in the colData with their age and organ of origin.

table(training.data$age)
## 
##   1st trimester   2nd trimester           Adult hESC stem cells 
##              36              75              61               4
table(training.data$organ)
## 
##         adrenal          amnion           brain         chorion 
##               6               7               9               3 
##             eye           gonad           heart         heart A 
##               6              15               5               4 
##         heart V hESC stem cells       intestine          kidney 
##               5               4              18              11 
##           liver            lung          mother          muscle 
##              10              12               3               7 
##        pancreas        placenta            skin     spinal cord 
##               9               7               8               2 
##          spleen         stomach          tongue  umbilical cord 
##               8               9               6               2

One can use these annotations to select the samples approriate for their use case.

my.training.set <- training.data[, training.data$age == "2nd trimester"]
table(my.training.set$age)
## 
## 2nd trimester 
##            75

Counts may not be available for all genes for all samples. The complete.cases function can be used to remove genes with missing values.

library(SummarizedExperiment)
my.training.set <- my.training.set[complete.cases(assay(my.training.set)),]
my.training.set
## class: SummarizedExperiment 
## dim: 28978 75 
## metadata(0):
## assays(1): counts
## rownames(28978): ENSG00000000003 ENSG00000000005 ...
##   ENSG00000267799 ENSG00000267801
## rowData names(0):
## colnames(75): pancreas_16-18 skin_16-18 ... lung_22.1 stomach_22.1
## colData names(2): age organ

Running KeyGenes

The keygenes.NGS function canbe used to run the KeyGenes algorithm. The input for this function is a matrix or SummarizedExperiment object for both the test and training data, as well as a (character) vector either giving the class (eg. organ) for each sample in the training data or the name of the column in the SummarizedExperiment’s colData containing these classes.

# Make a subset of samples for testing
set.seed(111)
test.samples <- sample(1:ncol(my.training.set), 4)

# Run KeyGenes
result <- keygenes.NGS(my.training.set[, test.samples],
                       my.training.set[, -test.samples],
                       "organ")

The results

The returned object contains a number of slots. The most important ones are the following: result, class.genes and prediction.matrix.

The result slot

The result slot contains the final classifications of the test samples:

result@result
##                truth predicted
## muscle_22         NA    muscle
## intestine_22.1    NA intestine
## stomach_16-18     NA   stomach
## pancreas_22       NA  pancreas

If we had provided the test.classes parameter to the keygenes.NGS function, the truth column would have contained these values and the accuracy slot would have contained the a percentage of correct predictions.

The class.genes slot

The class.genes slot contains the key genes in the classification. For each class, the genes which had an effect on the prediction (a coeficient larger than 0) is returned.

result@class.genes
## $adrenal
## [1] "ENSG00000105398" "ENSG00000112183" "ENSG00000147256"
## 
## $amnion
## [1] "ENSG00000094755" "ENSG00000113196" "ENSG00000128709" "ENSG00000167916"
## [5] "ENSG00000177494" "ENSG00000181143" "ENSG00000185668"
## 
## $brain
##  [1] "ENSG00000104435" "ENSG00000128709" "ENSG00000128918"
##  [4] "ENSG00000132130" "ENSG00000148798" "ENSG00000160808"
##  [7] "ENSG00000164434" "ENSG00000170091" "ENSG00000171885"
## [10] "ENSG00000196361" "ENSG00000233639"
## 
## $chorion
## [1] "ENSG00000002726" "ENSG00000069812" "ENSG00000079689" "ENSG00000105492"
## [5] "ENSG00000106366" "ENSG00000143867" "ENSG00000166183" "ENSG00000186652"
## [9] "ENSG00000205221"
## 
## $eye
## [1] "ENSG00000080166" "ENSG00000086548" "ENSG00000147655" "ENSG00000180660"
## [5] "ENSG00000184613" "ENSG00000185960"
## 
## $gonad
## [1] "ENSG00000101098" "ENSG00000115596" "ENSG00000136574" "ENSG00000152669"
## [5] "ENSG00000160886" "ENSG00000166917"
## 
## $`heart A`
## [1] "ENSG00000089225" "ENSG00000148942" "ENSG00000242349"
## 
## $`heart V`
## [1] "ENSG00000111245" "ENSG00000135902" "ENSG00000160808" "ENSG00000197893"
## [5] "ENSG00000241135"
## 
## $intestine
## [1] "ENSG00000016490" "ENSG00000162365" "ENSG00000173702" "ENSG00000253293"
## 
## $kidney
## [1] "ENSG00000074803" "ENSG00000116218" "ENSG00000166589"
## 
## $liver
## [1] "ENSG00000104760" "ENSG00000115221" "ENSG00000143867" "ENSG00000172482"
## [5] "ENSG00000187957" "ENSG00000189058"
## 
## $lung
## [1] "ENSG00000164265" "ENSG00000168878"
## 
## $mother
## [1] "ENSG00000116183" "ENSG00000131203" "ENSG00000153292" "ENSG00000172179"
## [5] "ENSG00000185269" "ENSG00000198759"
## 
## $muscle
## [1] "ENSG00000088002" "ENSG00000143632" "ENSG00000163209" "ENSG00000169583"
## [5] "ENSG00000183631" "ENSG00000185960" "ENSG00000253293"
## 
## $pancreas
## [1] "ENSG00000114204" "ENSG00000142789" "ENSG00000143954" "ENSG00000219073"
## 
## $placenta
## [1] "ENSG00000073754" "ENSG00000120211" "ENSG00000164707" "ENSG00000170498"
## [5] "ENSG00000243130"
## 
## $skin
## [1] "ENSG00000107165" "ENSG00000139648" "ENSG00000167768" "ENSG00000188508"
## [5] "ENSG00000189001"
## 
## $spleen
## [1] "ENSG00000073754" "ENSG00000133135" "ENSG00000136931" "ENSG00000183072"
## [5] "ENSG00000204128" "ENSG00000211899"
## 
## $stomach
## [1] "ENSG00000134812" "ENSG00000160181" "ENSG00000160182" "ENSG00000215182"
## 
## $tongue
## [1] "ENSG00000133055" "ENSG00000163209" "ENSG00000169474"

The prediction.matrix slot

The prediction.matrix slot contains linear predictor values. The final predictions are chosen by picking the class with the highest value for each sample.

result@prediction.matrix
##              muscle_22 intestine_22.1 stomach_16-18  pancreas_22
## adrenal   0.0030482516   6.938455e-04  2.734436e-04 0.0001096321
## amnion    0.0004918695   6.056142e-05  1.267502e-04 0.0005240460
## brain     0.0008408268   7.948537e-04  2.199399e-03 0.0046494785
## chorion   0.0001980717   3.001224e-05  8.444381e-05 0.0015856262
## eye       0.0016380161   3.283603e-05  9.814531e-04 0.0005749678
## gonad     0.0018096818   2.799284e-04  8.080725e-04 0.0007133202
## heart A   0.0004796314   2.166242e-04  1.171348e-03 0.0001408455
## heart V   0.0046946609   6.698917e-05  4.377744e-05 0.0001476011
## intestine 0.0003977874   9.955755e-01  2.078633e-02 0.0003250604
## kidney    0.0003899224   3.402144e-05  3.241324e-04 0.0002840320
## liver     0.0001604832   3.521787e-04  2.782975e-04 0.0009157060
## lung      0.0011152981   3.899581e-05  6.570957e-03 0.0004231700
## mother    0.0081841593   3.929244e-04  5.845890e-04 0.0016040991
## muscle    0.9597838614   1.481000e-04  2.318691e-04 0.0006464012
## pancreas  0.0019271654   3.287257e-04  1.786403e-04 0.9729857642
## placenta  0.0004823425   3.020367e-05  7.321994e-05 0.0002511530
## skin      0.0002708697   9.984445e-05  1.016793e-04 0.0007573200
## spleen    0.0068305872   1.013678e-04  2.313699e-04 0.0126593590
## stomach   0.0005990487   5.441142e-04  9.644811e-01 0.0003159813
## tongue    0.0066574647   1.783637e-04  4.691018e-04 0.0003864363

Plotting the predictions

The predictions can be visualized using the keygenes.heatmap function. This will draw a heatmap based on the prediction.matrix.

keygenes.heatmap(result)

A more complex example

In many cases it will be hard to tell exactly what data from the training set is most approriate for our test data. For example, when differentiating stem cell it is unknow whether the resulting cells are more akin to fetal or adult cells and this might differ between samples, depending on the timepoints which are measured.

Predicting age

In order to select the appropriate training data for each sample, the age of each sample is predicted first.

# Create a test and training set
set.seed(111)
training.data <- training.data[complete.cases(assay(training.data)),]
test.samples <- sample(1:ncol(training.data), 4)
test.set <- training.data[,test.samples]
training.set <- training.data[,-test.samples]

# Run KeyGenes
age.result <- keygenes.NGS(test.set, training.set, "age")
age.result@result
##                   truth     predicted
## amnion_22.1          NA 2nd trimester
## intestine_adult.6    NA         Adult
## skin_22              NA 2nd trimester
## eye_16-18.1          NA 2nd trimester

Predicting organ

Now that the ages of the samples are known, an appropriate set of test data can be selected for predicting the organs.

# Get the possible ages
ages <- as.character(unique(age.result@result$predicted)) 

# Run KeyGenes for each age category
organ.results <- sapply(ages, function(age) {
    # Further predictions for hESC stem cells are not possible!
    if (age != "hESC stem cells") {
        # Select the samples for each age.
        train.subset <- training.set[, training.set$age == age]
        test.subset <- test.set[, age.result@result$predicted == age]
        # Test sets need at least two samples
        if (ncol(test.subset) < 2) {
            test.subset <- cbind(test.subset, test.subset)
            colnames(test.subset) <- c(colnames(test.subset)[1], "duplicate")
        }
        # Run KeyGenes for each subset.
        #return(keygenes.NGS(test.subset, train.subset, "organ"))
        r <- keygenes.NGS(test.subset, train.subset, "organ")
    }
})
names(organ.results) <- ages
organ.results[sapply(organ.results, is.null)] <- NULL

# Merge the predictions
all.organs <- organ.results[[1]]@result
if (length(organ.results) > 1) {
    for (x in 2:length(organ.results)) {
        all.organs <- rbind(all.organs, organ.results[[x]]@result)
    }
}
all.organs <- all.organs[rownames(all.organs) != c("duplicate"),]
all.organs
##                   truth predicted
## amnion_22.1          NA    amnion
## skin_22              NA      skin
## eye_16-18.1          NA       eye
## intestine_adult.6    NA intestine

Making a heatmap

The following code can be used to generate a heatmap for these results. This heatmap will show both the age prediction values and the organ prediction values.

library(ggplot2)

# 
melted.prediction.matrix.list <- lapply(names(organ.results), function(age) {
    # Melt the prediction matrix
    melted <- melt(organ.results[[age]]@prediction.matrix)
    # Annotate each value with the age
    melted["age"] <- age
    # Label it as a tissue prediction value
    melted["type"] <- "tissue"
    melted <- melted[melted$Var2 != "duplicate",]
})

# Merge all the ages into one table
melted.prediction.matrix <- melted.prediction.matrix.list[[1]]
if (length(organ.results) > 1) {
    for (x in 2:length(organ.results)){
        melted.prediction.matrix <- rbind(melted.prediction.matrix, 
                                          melted.prediction.matrix.list[[x]])
    }
}

# Melt the age prediction matrix and add it as well.
melted.ages <- melt(age.result@prediction.matrix)
melted.ages["age"] <- age.result@result[melted.ages$Var2, "predicted"]
melted.ages["type"] <- "age"
melted.prediction.matrix <- rbind(melted.prediction.matrix, melted.ages)

# Plot it
g <- ggplot(data=melted.prediction.matrix)
g <- g + theme_minimal()
g <- g + geom_tile(aes(fill=value, x=Var2, y=Var1), color="gray")
g <- g + scale_fill_gradient2(low="black", mid="white", high = "green2", 
                              midpoint = 0.5)
g <- g + labs(x="", y="")
g <- g + facet_grid(type~age, space = "free", scales = "free")
g <- g + theme(axis.text.x = element_text(angle = 90, hjust = 1), 
               axis.ticks = element_blank())
g

SessionInfo

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.2             ggplot2_2.2.1             
##  [3] SummarizedExperiment_1.6.5 DelayedArray_0.2.7        
##  [5] matrixStats_0.52.2         Biobase_2.36.2            
##  [7] GenomicRanges_1.28.6       GenomeInfoDb_1.12.3       
##  [9] IRanges_2.10.5             S4Vectors_0.14.7          
## [11] BiocGenerics_0.22.1        KeyGenes_0.1              
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16            pillar_1.2.1           
##  [3] plyr_1.8.4              compiler_3.4.4         
##  [5] XVector_0.16.0          bitops_1.0-6           
##  [7] iterators_1.0.10        tools_3.4.4            
##  [9] zlibbioc_1.22.0         digest_0.6.12          
## [11] tibble_1.4.2            evaluate_0.10.1        
## [13] gtable_0.2.0            lattice_0.20-38        
## [15] rlang_0.2.0             Matrix_1.2-14          
## [17] foreach_1.4.4           GenomeInfoDbData_0.99.0
## [19] stringr_1.2.0           knitr_1.17             
## [21] rprojroot_1.2           glmnet_2.0-16          
## [23] grid_3.4.4              rmarkdown_1.8          
## [25] limma_3.32.10           magrittr_1.5           
## [27] backports_1.1.1         scales_0.5.0           
## [29] codetools_0.2-15        htmltools_0.3.6        
## [31] colorspace_1.3-2        labeling_0.3           
## [33] stringi_1.1.6           lazyeval_0.2.1         
## [35] RCurl_1.95-4.11         munsell_0.4.3

References

Cnop, M., Abdulkarim, B., Bottu, G., Cunha, D.A., Igoillo-Esteve, M., Masini, M., Turatsinze, J.V., Griebel, T., Villate, O., Santin, I., et al. (2014). RNA sequencing identifies dysregulation of the human pancreatic islet transcriptome by the saturated fatty acid palmitate. Diabetes. 63(6): 1978-1993.

Fagerberg, L., Hallstrom, B.M., Oksvold, P., Kampf, C., Djureinovic, D., Odeberg, J., Habuka, M., Tahmasebpoor, S., Danielsson, A., Edlund, K., et al. (2014). Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Molecular & cellular proteomics: MCP. 13(2): 397-406.

Forster R., Chiba K., Shaeffer L., Regalado S.G., Lai C.S., Gao Q., Kiani S., Farin H.F., Clevers H., Cost G.J., et al. (2014) Human intestinal tissue with adult stem cell properties derived from pluripotent stem cells. Stem cell reports. 2(6): 838-852.

Friedman J., Hastie T., and Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 33(1): 1-22.

Roost M.S., van Iperen L., Ariyurek Y., Buermans H.P., Arindrarto W., Devalla H.D., Passier R., Mummery C.L., Carlotti F., de Koning E.J.P., van Zwet E.W., Goeman J.J., and Chuva de Sousa Lopes S.M. (2015). KeyGenes, a tool to probe tissue differentiation using a human fetal transcriptional atlas. Stem Cell Reports. 4(6):1112-24.