KeyGenes is an algorithm to predict the identity and determines identity scores of queried samples (test set) based on a provided group of samples (training set). It uses transcriptional profiles of the test set and matches them to the transcriptional profiles of the training set. KeyGenes uses 10-fold cross validation on the basis of LASSO (Least Absolute Shrinkage and Selection Operator) regression available in the glmnet
R-package (Friedman et al., 2010).
Included in the KeyGenes package is a collection of RNA-seq count data usable for training. These data have been collected in a SummarizedExperiment object.
library(KeyGenes)
data("training.data")
training.data
## class: SummarizedExperiment
## dim: 63670 176
## metadata(0):
## assays(1): counts
## rownames(63670): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(176): hESC stem cells.DH3 hESC stem cells.DH6 ...
## muscle_adult gonad_adult.10
## colData names(2): age organ
This dataset contains hESC samples (Forster et al., 2014; Roost et al., 2015), 1st and 2nd trimester fetal samples (Roost et al., 2015) and adult samples (Cnop et al., 2014; Fagerberg et al., 2014; Illumina Body Map 2.0). Each sample is annotated in the colData
with their age
and organ
of origin.
table(training.data$age)
##
## 1st trimester 2nd trimester Adult hESC stem cells
## 36 75 61 4
table(training.data$organ)
##
## adrenal amnion brain chorion
## 6 7 9 3
## eye gonad heart heart A
## 6 15 5 4
## heart V hESC stem cells intestine kidney
## 5 4 18 11
## liver lung mother muscle
## 10 12 3 7
## pancreas placenta skin spinal cord
## 9 7 8 2
## spleen stomach tongue umbilical cord
## 8 9 6 2
One can use these annotations to select the samples approriate for their use case.
my.training.set <- training.data[, training.data$age == "2nd trimester"]
table(my.training.set$age)
##
## 2nd trimester
## 75
Counts may not be available for all genes for all samples. The complete.cases
function can be used to remove genes with missing values.
library(SummarizedExperiment)
my.training.set <- my.training.set[complete.cases(assay(my.training.set)),]
my.training.set
## class: SummarizedExperiment
## dim: 28978 75
## metadata(0):
## assays(1): counts
## rownames(28978): ENSG00000000003 ENSG00000000005 ...
## ENSG00000267799 ENSG00000267801
## rowData names(0):
## colnames(75): pancreas_16-18 skin_16-18 ... lung_22.1 stomach_22.1
## colData names(2): age organ
The keygenes.NGS
function canbe used to run the KeyGenes algorithm. The input for this function is a matrix
or SummarizedExperiment
object for both the test and training data, as well as a (character) vector either giving the class (eg. organ) for each sample in the training data or the name of the column in the SummarizedExperiment
’s colData
containing these classes.
# Make a subset of samples for testing
set.seed(111)
test.samples <- sample(1:ncol(my.training.set), 4)
# Run KeyGenes
result <- keygenes.NGS(my.training.set[, test.samples],
my.training.set[, -test.samples],
"organ")
The returned object contains a number of slots. The most important ones are the following: result
, class.genes
and prediction.matrix
.
The result
slot contains the final classifications of the test samples:
result@result
## truth predicted
## muscle_22 NA muscle
## intestine_22.1 NA intestine
## stomach_16-18 NA stomach
## pancreas_22 NA pancreas
If we had provided the test.classes
parameter to the keygenes.NGS
function, the truth
column would have contained these values and the accuracy
slot would have contained the a percentage of correct predictions.
The class.genes
slot contains the key genes in the classification. For each class, the genes which had an effect on the prediction (a coeficient larger than 0) is returned.
result@class.genes
## $adrenal
## [1] "ENSG00000105398" "ENSG00000112183" "ENSG00000147256"
##
## $amnion
## [1] "ENSG00000094755" "ENSG00000113196" "ENSG00000128709" "ENSG00000167916"
## [5] "ENSG00000177494" "ENSG00000181143" "ENSG00000185668"
##
## $brain
## [1] "ENSG00000104435" "ENSG00000128709" "ENSG00000128918"
## [4] "ENSG00000132130" "ENSG00000148798" "ENSG00000160808"
## [7] "ENSG00000164434" "ENSG00000170091" "ENSG00000171885"
## [10] "ENSG00000196361" "ENSG00000233639"
##
## $chorion
## [1] "ENSG00000002726" "ENSG00000069812" "ENSG00000079689" "ENSG00000105492"
## [5] "ENSG00000106366" "ENSG00000143867" "ENSG00000166183" "ENSG00000186652"
## [9] "ENSG00000205221"
##
## $eye
## [1] "ENSG00000080166" "ENSG00000086548" "ENSG00000147655" "ENSG00000180660"
## [5] "ENSG00000184613" "ENSG00000185960"
##
## $gonad
## [1] "ENSG00000101098" "ENSG00000115596" "ENSG00000136574" "ENSG00000152669"
## [5] "ENSG00000160886" "ENSG00000166917"
##
## $`heart A`
## [1] "ENSG00000089225" "ENSG00000148942" "ENSG00000242349"
##
## $`heart V`
## [1] "ENSG00000111245" "ENSG00000135902" "ENSG00000160808" "ENSG00000197893"
## [5] "ENSG00000241135"
##
## $intestine
## [1] "ENSG00000016490" "ENSG00000162365" "ENSG00000173702" "ENSG00000253293"
##
## $kidney
## [1] "ENSG00000074803" "ENSG00000116218" "ENSG00000166589"
##
## $liver
## [1] "ENSG00000104760" "ENSG00000115221" "ENSG00000143867" "ENSG00000172482"
## [5] "ENSG00000187957" "ENSG00000189058"
##
## $lung
## [1] "ENSG00000164265" "ENSG00000168878"
##
## $mother
## [1] "ENSG00000116183" "ENSG00000131203" "ENSG00000153292" "ENSG00000172179"
## [5] "ENSG00000185269" "ENSG00000198759"
##
## $muscle
## [1] "ENSG00000088002" "ENSG00000143632" "ENSG00000163209" "ENSG00000169583"
## [5] "ENSG00000183631" "ENSG00000185960" "ENSG00000253293"
##
## $pancreas
## [1] "ENSG00000114204" "ENSG00000142789" "ENSG00000143954" "ENSG00000219073"
##
## $placenta
## [1] "ENSG00000073754" "ENSG00000120211" "ENSG00000164707" "ENSG00000170498"
## [5] "ENSG00000243130"
##
## $skin
## [1] "ENSG00000107165" "ENSG00000139648" "ENSG00000167768" "ENSG00000188508"
## [5] "ENSG00000189001"
##
## $spleen
## [1] "ENSG00000073754" "ENSG00000133135" "ENSG00000136931" "ENSG00000183072"
## [5] "ENSG00000204128" "ENSG00000211899"
##
## $stomach
## [1] "ENSG00000134812" "ENSG00000160181" "ENSG00000160182" "ENSG00000215182"
##
## $tongue
## [1] "ENSG00000133055" "ENSG00000163209" "ENSG00000169474"
The prediction.matrix
slot contains linear predictor values. The final predictions are chosen by picking the class with the highest value for each sample.
result@prediction.matrix
## muscle_22 intestine_22.1 stomach_16-18 pancreas_22
## adrenal 0.0030482516 6.938455e-04 2.734436e-04 0.0001096321
## amnion 0.0004918695 6.056142e-05 1.267502e-04 0.0005240460
## brain 0.0008408268 7.948537e-04 2.199399e-03 0.0046494785
## chorion 0.0001980717 3.001224e-05 8.444381e-05 0.0015856262
## eye 0.0016380161 3.283603e-05 9.814531e-04 0.0005749678
## gonad 0.0018096818 2.799284e-04 8.080725e-04 0.0007133202
## heart A 0.0004796314 2.166242e-04 1.171348e-03 0.0001408455
## heart V 0.0046946609 6.698917e-05 4.377744e-05 0.0001476011
## intestine 0.0003977874 9.955755e-01 2.078633e-02 0.0003250604
## kidney 0.0003899224 3.402144e-05 3.241324e-04 0.0002840320
## liver 0.0001604832 3.521787e-04 2.782975e-04 0.0009157060
## lung 0.0011152981 3.899581e-05 6.570957e-03 0.0004231700
## mother 0.0081841593 3.929244e-04 5.845890e-04 0.0016040991
## muscle 0.9597838614 1.481000e-04 2.318691e-04 0.0006464012
## pancreas 0.0019271654 3.287257e-04 1.786403e-04 0.9729857642
## placenta 0.0004823425 3.020367e-05 7.321994e-05 0.0002511530
## skin 0.0002708697 9.984445e-05 1.016793e-04 0.0007573200
## spleen 0.0068305872 1.013678e-04 2.313699e-04 0.0126593590
## stomach 0.0005990487 5.441142e-04 9.644811e-01 0.0003159813
## tongue 0.0066574647 1.783637e-04 4.691018e-04 0.0003864363
The predictions can be visualized using the keygenes.heatmap
function. This will draw a heatmap based on the prediction.matrix
.
keygenes.heatmap(result)
In many cases it will be hard to tell exactly what data from the training set is most approriate for our test data. For example, when differentiating stem cell it is unknow whether the resulting cells are more akin to fetal or adult cells and this might differ between samples, depending on the timepoints which are measured.
In order to select the appropriate training data for each sample, the age of each sample is predicted first.
# Create a test and training set
set.seed(111)
training.data <- training.data[complete.cases(assay(training.data)),]
test.samples <- sample(1:ncol(training.data), 4)
test.set <- training.data[,test.samples]
training.set <- training.data[,-test.samples]
# Run KeyGenes
age.result <- keygenes.NGS(test.set, training.set, "age")
age.result@result
## truth predicted
## amnion_22.1 NA 2nd trimester
## intestine_adult.6 NA Adult
## skin_22 NA 2nd trimester
## eye_16-18.1 NA 2nd trimester
Now that the ages of the samples are known, an appropriate set of test data can be selected for predicting the organs.
# Get the possible ages
ages <- as.character(unique(age.result@result$predicted))
# Run KeyGenes for each age category
organ.results <- sapply(ages, function(age) {
# Further predictions for hESC stem cells are not possible!
if (age != "hESC stem cells") {
# Select the samples for each age.
train.subset <- training.set[, training.set$age == age]
test.subset <- test.set[, age.result@result$predicted == age]
# Test sets need at least two samples
if (ncol(test.subset) < 2) {
test.subset <- cbind(test.subset, test.subset)
colnames(test.subset) <- c(colnames(test.subset)[1], "duplicate")
}
# Run KeyGenes for each subset.
#return(keygenes.NGS(test.subset, train.subset, "organ"))
r <- keygenes.NGS(test.subset, train.subset, "organ")
}
})
names(organ.results) <- ages
organ.results[sapply(organ.results, is.null)] <- NULL
# Merge the predictions
all.organs <- organ.results[[1]]@result
if (length(organ.results) > 1) {
for (x in 2:length(organ.results)) {
all.organs <- rbind(all.organs, organ.results[[x]]@result)
}
}
all.organs <- all.organs[rownames(all.organs) != c("duplicate"),]
all.organs
## truth predicted
## amnion_22.1 NA amnion
## skin_22 NA skin
## eye_16-18.1 NA eye
## intestine_adult.6 NA intestine
The following code can be used to generate a heatmap for these results. This heatmap will show both the age prediction values and the organ prediction values.
library(ggplot2)
#
melted.prediction.matrix.list <- lapply(names(organ.results), function(age) {
# Melt the prediction matrix
melted <- melt(organ.results[[age]]@prediction.matrix)
# Annotate each value with the age
melted["age"] <- age
# Label it as a tissue prediction value
melted["type"] <- "tissue"
melted <- melted[melted$Var2 != "duplicate",]
})
# Merge all the ages into one table
melted.prediction.matrix <- melted.prediction.matrix.list[[1]]
if (length(organ.results) > 1) {
for (x in 2:length(organ.results)){
melted.prediction.matrix <- rbind(melted.prediction.matrix,
melted.prediction.matrix.list[[x]])
}
}
# Melt the age prediction matrix and add it as well.
melted.ages <- melt(age.result@prediction.matrix)
melted.ages["age"] <- age.result@result[melted.ages$Var2, "predicted"]
melted.ages["type"] <- "age"
melted.prediction.matrix <- rbind(melted.prediction.matrix, melted.ages)
# Plot it
g <- ggplot(data=melted.prediction.matrix)
g <- g + theme_minimal()
g <- g + geom_tile(aes(fill=value, x=Var2, y=Var1), color="gray")
g <- g + scale_fill_gradient2(low="black", mid="white", high = "green2",
midpoint = 0.5)
g <- g + labs(x="", y="")
g <- g + facet_grid(type~age, space = "free", scales = "free")
g <- g + theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.ticks = element_blank())
g
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] reshape2_1.4.2 ggplot2_2.2.1
## [3] SummarizedExperiment_1.6.5 DelayedArray_0.2.7
## [5] matrixStats_0.52.2 Biobase_2.36.2
## [7] GenomicRanges_1.28.6 GenomeInfoDb_1.12.3
## [9] IRanges_2.10.5 S4Vectors_0.14.7
## [11] BiocGenerics_0.22.1 KeyGenes_0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 pillar_1.2.1
## [3] plyr_1.8.4 compiler_3.4.4
## [5] XVector_0.16.0 bitops_1.0-6
## [7] iterators_1.0.10 tools_3.4.4
## [9] zlibbioc_1.22.0 digest_0.6.12
## [11] tibble_1.4.2 evaluate_0.10.1
## [13] gtable_0.2.0 lattice_0.20-38
## [15] rlang_0.2.0 Matrix_1.2-14
## [17] foreach_1.4.4 GenomeInfoDbData_0.99.0
## [19] stringr_1.2.0 knitr_1.17
## [21] rprojroot_1.2 glmnet_2.0-16
## [23] grid_3.4.4 rmarkdown_1.8
## [25] limma_3.32.10 magrittr_1.5
## [27] backports_1.1.1 scales_0.5.0
## [29] codetools_0.2-15 htmltools_0.3.6
## [31] colorspace_1.3-2 labeling_0.3
## [33] stringi_1.1.6 lazyeval_0.2.1
## [35] RCurl_1.95-4.11 munsell_0.4.3
Cnop, M., Abdulkarim, B., Bottu, G., Cunha, D.A., Igoillo-Esteve, M., Masini, M., Turatsinze, J.V., Griebel, T., Villate, O., Santin, I., et al. (2014). RNA sequencing identifies dysregulation of the human pancreatic islet transcriptome by the saturated fatty acid palmitate. Diabetes. 63(6): 1978-1993.
Fagerberg, L., Hallstrom, B.M., Oksvold, P., Kampf, C., Djureinovic, D., Odeberg, J., Habuka, M., Tahmasebpoor, S., Danielsson, A., Edlund, K., et al. (2014). Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Molecular & cellular proteomics: MCP. 13(2): 397-406.
Forster R., Chiba K., Shaeffer L., Regalado S.G., Lai C.S., Gao Q., Kiani S., Farin H.F., Clevers H., Cost G.J., et al. (2014) Human intestinal tissue with adult stem cell properties derived from pluripotent stem cells. Stem cell reports. 2(6): 838-852.
Friedman J., Hastie T., and Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 33(1): 1-22.
Roost M.S., van Iperen L., Ariyurek Y., Buermans H.P., Arindrarto W., Devalla H.D., Passier R., Mummery C.L., Carlotti F., de Koning E.J.P., van Zwet E.W., Goeman J.J., and Chuva de Sousa Lopes S.M. (2015). KeyGenes, a tool to probe tissue differentiation using a human fetal transcriptional atlas. Stem Cell Reports. 4(6):1112-24.