vignettes/RAIDS.Rmd
RAIDS.Rmd
Package: RAIDS
Authors: Pascal Belleau [cre, aut] (https://orcid.org/0000-0002-0802-1071), Astrid Deschênes
[aut] (https://orcid.org/0000-0001-7846-6749), David A. Tuveson
[aut] (https://orcid.org/0000-0002-8017-2712), Alexander
Krasnitz [aut]
Version: 1.3.3
Compiled date: 2024-10-22
License: Apache License (>= 2)
The RAIDS
package and the underlying RAIDS code are
distributed under
the https://opensource.org/licenses/Apache-2.0 license. You
are free to use and redistribute this software.
If you use the RAIDS package for a publication, we would ask you to cite the following:
Pascal Belleau, Astrid Deschênes, Nyasha Chambwe, David A. Tuveson, Alexander Krasnitz; Genetic Ancestry Inference from Cancer-Derived Molecular Data across Genomic and Transcriptomic Platforms. Cancer Res 1 January 2023; 83 (1): 49–58. https://doi.org/10.1158/0008-5472.CAN-22-0682
Multiple methods have been implemented to infer ancestry from germline DNA sequence (Price et al. 2006; Pritchard, Stephens, and Donnelly 2000; Alexander, Novembre, and Lange 2009). However, genotyping of DNA from matched normal specimens is not part of standard clinical practice and is not performed routinely outside academic clinical centers. In sum, matched germline DNA sequence is often missing for cancer-derived molecular data. In such cases, having the possibility to infer ancestry from tumor-derived data would be beneficial.
The RAIDS package implements an inference procedure that has been specifically developed to accurately infer genetic ancestry from cancer-derived sequences. The current version can handle cancer-derived sequences of:
The RAIDS package implements a data synthesis method that, for any given cancer-derived sequence profile, enables on the one hand, profile-specific inference parameter optimization and on the other hand, a profile-specific inference accuracy estimate.
To install this package from Bioconductor, start R (version 4.3 or later) and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RAIDS")
This is an overview of the genetic ancestry inference from cancer-derived molecular data:
The main steps are:
Step 1. Set-up and provide population reference files
Step 2 Sample the reference data for donor genotypes to be used for synthesis and optimize ancestry inference parameters
Step 3 Infer ancestry for the subjects of the external study
Step 4 Present and interpret the results
These steps are described in detail in the following.
First, a specific directory structure should be created. The structure must correspond to this:
#############################################################################
## Working directory structure
#############################################################################
workingDirectory/
data/
refGDS
profileGDS
This following code creates a temporary working directory structure where the example will be run.
#############################################################################
## Create a temporary working directory structure
#############################################################################
pathWorkingDirectory <- file.path(tempdir(), "workingDirectory")
pathWorkingDirectoryData <- file.path(pathWorkingDirectory, "data")
if (!dir.exists(pathWorkingDirectory)) {
dir.create(pathWorkingDirectory)
dir.create(pathWorkingDirectoryData)
dir.create(file.path(pathWorkingDirectoryData, "refGDS"))
dir.create(file.path(pathWorkingDirectoryData, "profileGDS"))
}
The population reference files should be downloaded in the data/refGDS sub-directory. This following code downloads the complete pre-processed files for 1000 Genomes (1KG), in hg38. The size of the 1KG GDS file is 15GB.
#############################################################################
## How to download the pre-processed files for 1000 Genomes (1KG) (15 GB)
#############################################################################
cd workingDirectory
cd data/refGDS
wget https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper/matGeno1000g.gds
wget https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper/matAnnot1000g.gds
cd -
For demonstration purpose, a small population reference GDS file (called ex1_good_small_1KG.gds) and a small population reference SNV Annotation GDS file (called ex1_good_small_1KG_Annot.gds) are included in this package. Beware that those two files should not be used to run a real ancestry inference. The results obtained with those files won’t be reliable.
In this running example, the demonstration files are copied in the required data/refGDS directory.
#############################################################################
## Load RAIDS package
#############################################################################
library(RAIDS)
#############################################################################
## The population reference GDS file and SNV Annotation GDS file
## need to be located in the same sub-directory.
## Note that the population reference GDS file used for this example is a
## simplified version and CANNOT be used for any real analysis
#############################################################################
## Path to the demo 1KG GDS file is located in this package
dataDir <- system.file("extdata", package="RAIDS")
pathReference <- file.path(dataDir, "tests")
fileGDS <- file.path(pathReference, "ex1_good_small_1KG.gds")
fileAnnotGDS <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds")
file.copy(fileGDS, file.path(pathWorkingDirectoryData, "refGDS"))
## [1] TRUE
file.copy(fileAnnotGDS, file.path(pathWorkingDirectoryData, "refGDS"))
## [1] TRUE
All required directories need to be created. In addition, the path to the reference files are kept in variables that will be used later.
#############################################################################
## The file path to the population reference GDS file
## is required (refGenotype will be used as input later)
## The file path to the population reference SNV Annotation GDS file
## is also required (refAnnotation will be used as input later)
#############################################################################
pathReference <- file.path(pathWorkingDirectoryData, "refGDS")
refGenotype <- file.path(pathReference, "ex1_good_small_1KG.gds")
refAnnotation <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds")
#############################################################################
## The output profileGDS directory, inside workingDirectory/data, must be
## created (pathProfileGDS will be used as input later)
#############################################################################
pathProfileGDS <- file.path(pathWorkingDirectoryData, "profileGDS")
if (!dir.exists(pathProfileGDS)) {
dir.create(pathProfileGDS)
}
With the 1KG reference, we recommend sampling 30 donor profiles per population. For reproducibility, be sure to use the same random-number generator seed.
In the following code, only 2 profiles per population are sampled from the demo population GDS file:
#############################################################################
## Fix seed to ensure reproducible results
#############################################################################
set.seed(3043)
#############################################################################
## Select the profiles from the population reference GDS file for
## the synthetic data.
## Here we select 2 profiles from the simplified 1KG GDS for each
## subcontinental-level.
## Normally, we would use 30 profiles for each subcontinental-level.
## The 1KG files in this example only have 6 profiles for each
## subcontinental-level (for demo purpose only).
#############################################################################
dataRef <- select1KGPopForSynthetic(fileReferenceGDS=refGenotype,
nbProfiles=2L)
The output object is going to be used later.
Ancestry inference can be done in one function call. Within a single function call, data synthesis is performed, the synthetic data are used to optimize the inference parameters and, with these, the ancestry of the input profile donor is inferred.
According to the type of input data (RNA or DNA), a specific function should be called. The inferAncestry() function is used for DNA profiles while the inferAncestryGeneAware() function is RNA specific.
The inferAncestry() function requires a specific profile input format. The format is set by the genoSource parameter.
One of those formats is in a VCF format (genoSource=c(“VCF”)). This format follows the VCF standard with at least those genotype fields: GT, AD and DP. The SNVs must be germline variants and should include the genotype of the wild-type homozygous at the selected positions in the reference. The VCF file must be gzipped.
A generic SNP file can replace the VCF file (genoSource=c(“generic”)). The format is coma separated and the mandatory columns are:
Beware that the starting position in the population reference GDS file is zero (like BED files). The generic SNP file should also start at position zero.
In this example, the profile is from DNA source and requires the use of the inferAncestry() function.
###########################################################################
## GenomeInfoDb and BSgenome are required libraries to run this example
###########################################################################
if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
#######################################################################
## Chromosome length information is required
## chr23 is chrX, chr24 is chrY and chrM is 25
#######################################################################
genome <- BSgenome.Hsapiens.UCSC.hg38::Hsapiens
chrInfo <- GenomeInfoDb::seqlengths(genome)[1:25]
#######################################################################
## The demo SNP VCF file of the DNA profile donor
#######################################################################
fileDonorVCF <- file.path(dataDir, "example", "snpPileup", "ex1.vcf.gz")
#######################################################################
## The ancestry inference call
#######################################################################
resOut <- inferAncestry(profileFile=fileDonorVCF,
pathProfileGDS=pathProfileGDS,
fileReferenceGDS=refGenotype,
fileReferenceAnnotGDS=refAnnotation,
chrInfo=chrInfo,
syntheticRefDF=dataRef,
genoSource=c("VCF"))
}
A profile GDS file is created in the pathProfileGDS directory while all the ancestry and optimal parameters information are integrated in the output object.
At last, all temporary files created in this example should be deleted.
#######################################################################
## Remove temporary files created for this demo
#######################################################################
unlink(pathWorkingDirectory, recursive=TRUE, force=TRUE)
The inferred ancestry and the optimal parameters are present in the list object generated by the inferAncestry() and inferAncestryGeneAware() functions.
###########################################################################
## The output is a list object with multiple entries
###########################################################################
class(resOut)
## [1] "list"
names(resOut)
## [1] "pcaSample" "paraSample" "KNNSample" "KNNSynthetic" "Ancestry"
For the global ancestry inference using PCA followed by nearest neighbor classification these parameters are D (the number of the top principal directions retained) and k (the number of nearest neighbors).
The information is stored in the Ancestry entry as a data.frame object. It is a contains those columns:
###########################################################################
## The ancestry information is stored in the 'Ancestry' entry
###########################################################################
print(resOut$Ancestry)
## sample.id D K SuperPop
## 173 ex1 14 6 AFR
The createAUROCGraph() function enable the visualization of RAIDS performance for the synthetic data, as a function of D and k.
###########################################################################
## Create a graph showing the perfomance for the synthetic data
## The output is a ggplot object
###########################################################################
createAUROCGraph(dfAUROC=resOut$paraSample$dfAUROC, title="Example ex1")
In this specific example, the performances are lower than expected with a real profile and a complete reference population file.
A population reference dataset with known ancestry is required to infer ancestry.
Three important reference files, containing formatted information about the reference dataset, are required:
The format of those files are described the Population reference dataset GDS files vignette.
The reference files associated to the Cancer Research associated paper are available. Note that these pre-processed files are for 1000 Genomes (1KG), in hg38. The files are available here:
https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper
The size of the 1KG GDS file is 15GB.
The 1KG GDS file is mapped on hg38 (Lowy-Gallego et al. 2019).
This section can be skipped if you choose to use the pre-processed files.
Here is the output of sessionInfo()
in the environment
in which this document was compiled:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Matrix_1.7-0 RAIDS_1.3.3 GENESIS_2.34.0 SNPRelate_1.38.1
## [5] gdsfmt_1.40.2 knitr_1.48 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.9 shape_1.4.6.1
## [3] magrittr_2.0.3 jomo_2.7-6
## [5] GenomicFeatures_1.56.0 farver_2.1.2
## [7] logistf_1.26.0 nloptr_2.1.1
## [9] rmarkdown_2.28 fs_1.6.4
## [11] BiocIO_1.14.0 zlibbioc_1.50.0
## [13] ragg_1.3.2 vctrs_0.6.5
## [15] memoise_2.0.1 minqa_1.2.8
## [17] Rsamtools_2.20.0 RCurl_1.98-1.16
## [19] quantsmooth_1.70.0 htmltools_0.5.8.1
## [21] S4Arrays_1.4.1 curl_5.2.3
## [23] broom_1.0.7 pROC_1.18.5
## [25] SparseArray_1.4.8 mitml_0.4-5
## [27] sass_0.4.9 bslib_0.8.0
## [29] htmlwidgets_1.6.4 desc_1.4.3
## [31] plyr_1.8.9 sandwich_3.1-1
## [33] zoo_1.8-12 cachem_1.1.0
## [35] GenomicAlignments_1.40.0 lifecycle_1.0.4
## [37] iterators_1.0.14 pkgconfig_2.0.3
## [39] R6_2.5.1 fastmap_1.2.0
## [41] GenomeInfoDbData_1.2.12 MatrixGenerics_1.16.0
## [43] digest_0.6.37 colorspace_2.1-1
## [45] GWASTools_1.50.0 AnnotationDbi_1.66.0
## [47] S4Vectors_0.42.1 textshaping_0.4.0
## [49] GenomicRanges_1.56.2 RSQLite_2.3.7
## [51] labeling_0.4.3 fansi_1.0.6
## [53] httr_1.4.7 abind_1.4-8
## [55] mgcv_1.9-1 compiler_4.4.1
## [57] withr_3.0.1 bit64_4.5.2
## [59] backports_1.5.0 BiocParallel_1.38.0
## [61] DBI_1.2.3 highr_0.11
## [63] pan_1.9 MASS_7.3-61
## [65] quantreg_5.98 DelayedArray_0.30.1
## [67] rjson_0.2.23 DNAcopy_1.78.0
## [69] tools_4.4.1 lmtest_0.9-40
## [71] nnet_7.3-19 glue_1.8.0
## [73] restfulr_0.0.15 nlme_3.1-165
## [75] grid_4.4.1 generics_0.1.3
## [77] gtable_0.3.5 operator.tools_1.6.3
## [79] BSgenome_1.72.0 formula.tools_1.7.1
## [81] class_7.3-22 tidyr_1.3.1
## [83] ensembldb_2.28.1 data.table_1.16.2
## [85] utf8_1.2.4 XVector_0.44.0
## [87] BiocGenerics_0.50.0 GWASExactHW_1.2
## [89] stringr_1.5.1 foreach_1.5.2
## [91] pillar_1.9.0 splines_4.4.1
## [93] dplyr_1.1.4 lattice_0.22-6
## [95] survival_3.7-0 rtracklayer_1.64.0
## [97] bit_4.5.0 SparseM_1.84-2
## [99] BSgenome.Hsapiens.UCSC.hg38_1.4.5 tidyselect_1.2.1
## [101] SeqVarTools_1.42.0 Biostrings_2.72.1
## [103] bookdown_0.41 ProtGenerics_1.36.0
## [105] IRanges_2.38.1 SummarizedExperiment_1.34.0
## [107] stats4_4.4.1 xfun_0.48
## [109] Biobase_2.64.0 matrixStats_1.4.1
## [111] stringi_1.8.4 UCSC.utils_1.0.0
## [113] lazyeval_0.2.2 yaml_2.3.10
## [115] boot_1.3-30 evaluate_1.0.1
## [117] codetools_0.2-20 tibble_3.2.1
## [119] BiocManager_1.30.25 cli_3.6.3
## [121] rpart_4.1.23 systemfonts_1.1.0
## [123] munsell_0.5.1 jquerylib_0.1.4
## [125] Rcpp_1.0.13 GenomeInfoDb_1.40.1
## [127] png_0.1-8 XML_3.99-0.17
## [129] parallel_4.4.1 MatrixModels_0.5-3
## [131] ggplot2_3.5.1 pkgdown_2.1.1
## [133] blob_1.2.4 AnnotationFilter_1.28.0
## [135] bitops_1.0-9 lme4_1.1-35.5
## [137] glmnet_4.1-8 SeqArray_1.44.3
## [139] VariantAnnotation_1.50.0 scales_1.3.0
## [141] purrr_1.0.2 crayon_1.5.3
## [143] rlang_1.1.4 KEGGREST_1.44.1
## [145] mice_3.16.0