vignettes/Create_Reference_GDS_File.Rmd
Create_Reference_GDS_File.Rmd
Package: RAIDS
Authors: Pascal Belleau [cre, aut] (https://orcid.org/0000-0002-0802-1071), Astrid Deschênes
[aut] (https://orcid.org/0000-0001-7846-6749), David A. Tuveson
[aut] (https://orcid.org/0000-0002-8017-2712), Alexander
Krasnitz [aut]
Version: 1.3.3
Compiled date: 2024-10-22
License: Apache License (>= 2)
This vignette explains, in further details, the format of the population reference files that are required to run the ancestry inference tool.
Two different files are generated from a reference dataset:
The Population Reference GDS file should contain the genome-wide SNV information related to the population data set with known genetic ancestry. This reference data set will be used to generate the simulated samples. It is also used to generate the PCA on which the samples of interest are going to be projected.
The Population Reference GDS file is a GDS object of class SNPGDSFileClass from SNPRelate package (Zheng et al. 2012).
Beware that related profiles should be flagged in the Population Reference GDS file files.
#############################################################################
## Load required packages
#############################################################################
library(RAIDS)
library(SNPRelate)
pathRef <- system.file("extdata/", package="RAIDS")
fileReferenceGDS <- file.path(pathRef, "PopulationReferenceDemo.gds")
gdsRef <- snpgdsOpen(fileReferenceGDS)
## Show the file format
print(gdsRef)
## File: /__w/_temp/Library/RAIDS/extdata/PopulationReferenceDemo.gds (3.2K)
## + [ ]
## |--+ sample.id { Str8 10, 80B }
## |--+ sample.annot [ data.frame ] *
## | |--+ sex { Str8 10, 20B }
## | |--+ pop.group { Str8 10, 40B }
## | |--+ superPop { Str8 10, 40B }
## | \--+ batch { Float64 10, 80B }
## |--+ snp.id { Str8 7, 21B }
## |--+ snp.chromosome { UInt16 7, 14B }
## |--+ snp.position { Int32 7, 28B }
## |--+ snp.allele { Str8 7, 28B }
## |--+ snp.AF { PackedReal24 7, 21B }
## |--+ snp.EAS_AF { PackedReal24 7, 21B }
## |--+ snp.EUR_AF { PackedReal24 7, 21B }
## |--+ snp.AFR_AF { PackedReal24 7, 21B }
## |--+ snp.AMR_AF { PackedReal24 7, 21B }
## |--+ snp.SAS_AF { PackedReal24 7, 21B }
## |--+ genotype { Bit2 7x10, 18B }
## \--+ sample.ref { Bit1 10, 2B }
closefn.gds(gdsRef)
This output lists all variables stored in the Population Reference GDS file. At the first level, it stores variables sample.id, snp.id, etc. The additional information displayed in the braces indicate the data type, size, compressed or not with compression ratio.
The mandatory fields are:
This following example shows how to create a Population GDS Reference file. This example is for demonstration purpose only and use hard coded values. A working Population GDS Reference file would have to contain multiple samples from each continental population and would also have to contain the SNVs from the entire genome.
To generate a real Population GDS Reference file, the pipeline to process the information would depend of the selected source. If the source files are in VCF format, you can use Bioconductor VariationAnnotation package to extract the genotypic information (beware it may use a lot of memory). Often, you will need to parse metadata files to get information such as the sex and population of the profiles. In addition, the Bioconductor GENESIS package can be used to compute kinship coefficients to identify the unrelated profiles.
#############################################################################
## Load required packages
#############################################################################
library(RAIDS)
library(SNPRelate)
library(gdsfmt)
## Create a temporary GDS Reference file in the temporary directory
fileNewReferenceGDS <- file.path(tempdir(), "reference_DEMO.gds")
gdsRefNew <- createfn.gds(fileNewReferenceGDS)
## The entry 'sample.id' contain the unique identifiers of 10 samples
## that constitute the reference dataset
sample.id <- c("HG00243", "HG00150", "HG00149", "HG00246", "HG00138",
"HG01334", "HG00268", "HG00275", "HG00290", "HG00364")
add.gdsn(node=gdsRefNew, name="sample.id", val=sample.id,
storage="string", check=TRUE)
## A data frame containing the information about the 10 samples
## (in the same order than in the 'sample.id') is created and added to
## the 'sample.annot' entry
## The data frame must contain those columns:
## 'sex': '1'=male, '2'=female
## 'pop.group': acronym for the population (ex: GBR, CDX, MSL, ASW, etc..)
## 'superPop': acronym for the super-population (ex: AFR, EUR, etc...)
## 'batch': number identifying the batch of provenance
sampleInformation <- data.frame(sex=c("1", "2", "1", "1", "1",
"1", "2", "2", "1", "2"), pop.group=c(rep("GBR", 6), rep("FIN", 4)),
superPop=c(rep("EUR", 10)), batch=rep(0, 10), stringsAsFactors=FALSE)
add.gdsn(node=gdsRefNew, name="sample.annot", val=sampleInformation,
check=TRUE)
## The identifier of each SNV is added in the 'snp.id' entry
snvID <- c("s29603", "s29605", "s29633", "s29634", "s29635", "s29637",
"s29638", "s29663", "s29664", "s29666", "s29667", "s29686",
"s29687", "s29711", "s29741", "s29742", "s29746", "s29750",
"s29751", "s29753")
add.gdsn(node=gdsRefNew, name="snp.id", val=snvID, check=TRUE)
## The chromosome of each SNV is added to the 'snp.chromosome' entry
## The order of the SNVs is the same than in the 'snp.id' entry
snvChrom <- c(rep(1, 20))
add.gdsn(node=gdsRefNew, name="snp.chromosome", val=snvChrom, storage="uint16",
check=TRUE)
## The position on the chromosome of each SNV is added to
## the 'snp.position' entry
## The order of the SNVs is the same than in the 'snp.id' entry
snvPos <- c(3467333, 3467428, 3469375, 3469387, 3469502, 3469527,
3469737, 3471497, 3471565, 3471618)
add.gdsn(node=gdsRefNew, name="snp.position", val=snvPos, storage="int32",
check=TRUE)
## The allele information of each SNV is added to the 'snp.allele' entry
## The order of the SNVs is the same than in the 'snp.allele' entry
snvAllele <- c("A/G", "C/G", "C/T", "C/T", "T/G", "C/T",
"G/A", "A/G", "G/A", "G/A")
add.gdsn(node=gdsRefNew, name="snp.allele", val=snvAllele, storage="string",
check=TRUE)
## The allele frequency in the general population (between 0 and 1) of each
## SNV is added to the 'snp.AF' entry
## The order of the SNVs is the same than in the 'snp.id' entry
snvAF <- c(0.86, 0.01, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0.01)
add.gdsn(node=gdsRefNew, name="snp.AF", val=snvAF, storage="packedreal24",
check=TRUE)
## The allele frequency in the East Asian population (between 0 and 1) of each
## SNV is added to the 'snp.EAS_AF' entry
## The order of the SNVs is the same than in the 'snp.id' entry
snvAF <- c(0.80, 0.00, 0.00, 0.01, 0.00, 0.00, 0.01, 0.00, 0.02, 0.00)
add.gdsn(node=gdsRefNew, name="snp.EAS_AF", val=snvAF, storage="packedreal24",
check=TRUE)
## The allele frequency in the European population (between 0 and 1) of each
## SNV is added to the 'snp.EUR_AF' entry
## The order of the SNVs is the same than in the 'snp.id' entry
snvAF <- c(0.91, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.03)
add.gdsn(node=gdsRefNew, name="snp.EUR_AF", val=snvAF, storage="packedreal24",
check=TRUE)
## The allele frequency in the African population (between 0 and 1) of each
## SNV is added to the 'snp.AFR_AF' entry
## The order of the SNVs is the same than in the 'snp.id' entry
snvAF <- c(0.85, 0.04, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00)
add.gdsn(node=gdsRefNew, name="snp.AFR_AF", val=snvAF, storage="packedreal24",
check=TRUE)
## The allele frequency in the American population (between 0 and 1) of each
## SNV is added to the 'snp.AMR_AF' entry
## The order of the SNVs is the same than in the 'snp.id' entry
snvAF <- c(0.83, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.02)
add.gdsn(node=gdsRefNew, name="snp.AMR_AF", val=snvAF, storage="packedreal24",
check=TRUE)
## The allele frequency in the South Asian population (between 0 and 1) of each
## SNV is added to the 'snp.SAS_AF' entry
## The order of the SNVs is the same than in the 'snp.id' entry
snvAF <- c(0.89, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.01, 0.00, 0.00)
add.gdsn(node=gdsRefNew, name="snp.SAS_AF", val=snvAF, storage="packedreal24",
check=TRUE)
## The genotype of each SNV for each sample is added to the 'genotype' entry
## The genotype correspond to the number of A alleles
## The rows represent the SNVs is the same order than in 'snp.id' entry
## The columns represent the samples is the same order than in 'sample.id' entry
genotypeInfo <- matrix(data=c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), nrow=10, byrow=TRUE)
add.gdsn(node=gdsRefNew, name="genotype", val=genotypeInfo,
storage="bit2", check=TRUE)
## The entry 'sample.ref' is filled with 1 indicating that all 10
## samples are retained to be used as reference
## The order of the samples is the same than in the 'sample.id' entry
add.gdsn(node=gdsRefNew, name="sample.ref", val=rep(1L, 10),
storage="bit1", check=TRUE)
## Show the file format
print(gdsRefNew)
## File: /tmp/Rtmptpa656/reference_DEMO.gds (1.6K)
## + [ ]
## |--+ sample.id { Str8 10, 80B }
## |--+ sample.annot [ data.frame ] *
## | |--+ sex { Str8 10, 20B }
## | |--+ pop.group { Str8 10, 40B }
## | |--+ superPop { Str8 10, 40B }
## | \--+ batch { Float64 10, 80B }
## |--+ snp.id { Str8 20, 140B }
## |--+ snp.chromosome { UInt16 20, 40B }
## |--+ snp.position { Int32 10, 40B }
## |--+ snp.allele { Str8 10, 40B }
## |--+ snp.AF { PackedReal24 10, 30B }
## |--+ snp.EAS_AF { PackedReal24 10, 30B }
## |--+ snp.EUR_AF { PackedReal24 10, 30B }
## |--+ snp.AFR_AF { PackedReal24 10, 30B }
## |--+ snp.AMR_AF { PackedReal24 10, 30B }
## |--+ snp.SAS_AF { PackedReal24 10, 30B }
## |--+ genotype { Bit2 10x10, 25B }
## \--+ sample.ref { Bit1 10, 2B }
closefn.gds(gdsRefNew)
unlink(fileNewReferenceGDS, force=TRUE)
The Population Reference Annotation GDS file contains phase
information and block group information for all the SNVs present in
Population Reference GDS file.
If the source files are in VCF format, you can use Bioconductor VariationAnnotation
package to extract the phase information (beware it may use a lot of
memory). A block can be a linkage disequelibrium block relative to a
population or a gene. A bioconductor package like GENESIS
can be used to get the block information.
#############################################################################
## Load required packages
#############################################################################
library(RAIDS)
library(SNPRelate)
pathReference <- system.file("extdata/tests", package="RAIDS")
fileReferenceAnnotGDS <- file.path(pathReference, "ex1_good_small_1KG.gds")
gdsRefAnnot <- openfn.gds(fileReferenceAnnotGDS)
## Show the file format
print(gdsRefAnnot)
## File: /__w/_temp/Library/RAIDS/extdata/tests/ex1_good_small_1KG.gds (830.0K)
## + [ ] *
## |--+ sample.id { Str8 156, 1.2K }
## |--+ sample.annot [ data.frame ] *
## | |--+ sex { Str8 156, 312B }
## | |--+ pop.group { Str8 156, 624B }
## | |--+ superPop { Str8 156, 624B }
## | \--+ batch { Float64 156, 1.2K }
## |--+ snp.id { Str8 11000, 103.5K }
## |--+ snp.chromosome { UInt16 11000, 21.5K }
## |--+ snp.position { Int32 11000, 43.0K }
## |--+ snp.allele { Str8 11000, 43.0K }
## |--+ snp.AF { PackedReal24 11000, 32.2K }
## |--+ snp.EAS_AF { PackedReal24 11000, 32.2K }
## |--+ snp.EUR_AF { PackedReal24 11000, 32.2K }
## |--+ snp.AFR_AF { PackedReal24 11000, 32.2K }
## |--+ snp.AMR_AF { PackedReal24 11000, 32.2K }
## |--+ snp.SAS_AF { PackedReal24 11000, 32.2K }
## |--+ genotype { Bit2 11000x156, 418.9K }
## \--+ sample.ref { Bit1 156, 20B }
closefn.gds(gdsRefAnnot)
This output lists all variables stored in the Population Reference Annotation GDS file. At the first level, it stores variables phase, block.annot, etc. The additional information displayed in the braces indicate the data type, size, compressed or not + compression ratio.
The mandatory fields are:
This following example shows how to create a Population Reference Annotation GDS file. This example is for demonstration purpose only. A working Population Reference Annotation GDS file would have to contain multiple samples from each continental population and would also have to contain the SNVs from the entire genome.
#############################################################################
## Load required packages
#############################################################################
library(RAIDS)
library(gdsfmt)
## Create a temporary GDS Reference file in the temporary directory
fileNewReferenceAnnotGDS <-
file.path(tempdir(), "reference_SNV_Annotation_DEMO.gds")
gdsRefAnnotNew <- createfn.gds(fileNewReferenceAnnotGDS)
## The entry 'phase' contain the phase of the SNVs in the
## Population Annotation GDS file
## 0 means the first allele is a reference; 1 means the first allele is
## the alternative and 3 means unknown
## The SNVs (rows) and samples (columns) in phase are in the same order as
## in the Population Annotation GDS file.
phase <- matrix(data=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 1, 0, 1, 1, 0, 1, 1, 1, 1,
0, 1, 0, 1, 1, 0, 1, 1, 1, 1,
0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 1, 0, 1, 1, 1, 1), ncol=10, byrow=TRUE)
add.gdsn(node=gdsRefAnnotNew, name="phase", val=phase, storage="bit2",
check=TRUE)
## The entry 'blockAnnot' contain the information for each group of blocks
## that are present in the 'block' entry.
blockAnnot <- data.frame(block.id=c("EAS.0.05.500k", "EUR.0.05.500k",
"AFR.0.05.500k", "AMR.0.05.500k", "SAS.0.05.500k"),
block.desc=c(
"EAS populationblock base on SNP 0.05 and windows 500k",
"EUR populationblock base on SNP 0.05 and windows 500k",
"AFR populationblock base on SNP 0.05 and windows 500k",
"AMR populationblock base on SNP 0.05 and windows 500k",
"SAS populationblock base on SNP 0.05 and windows 500k"),
stringsAsFactors=FALSE)
add.gdsn(node=gdsRefAnnotNew, name="block.annot", val=blockAnnot, check=TRUE)
## The entry 'block' contain the block information for the SNVs in the
## Population Annotation GDS file.
## The SNVs (rows) are in the same order as in
## the Population Annotation GDS file.
## The block groups (columns) are in the same order as in
## the 'block.annot' entry.
block <- matrix(data=c(-1, -1, -1, -1, -1,
-2, -2, 1, -2, -2,
-2, 1, 1, 1, -2,
-2, 1, 1, 1, -2,
-2, -3, -2, -3, -2,
1, 2, 2, 2, 1,
1, 2, 2, 2, 1,
-3, -4, -3, -4, -3,
2, -4, 3, -4, -3,
2, -4, 3, -4, -3), ncol=5, byrow=TRUE)
add.gdsn(node=gdsRefAnnotNew, name="block", val=block, storage="int32",
check=TRUE)
## Show the file format
print(gdsRefAnnotNew)
## File: /tmp/Rtmptpa656/reference_SNV_Annotation_DEMO.gds (427B)
## + [ ]
## |--+ phase { Bit2 10x10, 25B }
## |--+ block.annot [ data.frame ] *
## | |--+ block.id { Str8 5, 70B }
## | \--+ block.desc { Str8 5, 270B }
## \--+ block { Int32 10x5, 200B }
closefn.gds(gdsRefAnnotNew)
unlink(fileNewReferenceAnnotGDS, force=TRUE)
Pre-processed files used in the RAIDS associated publication, are available at this address:
https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper
Beware that some of those files are voluminous.
Here is the output of sessionInfo()
on the system on
which this document was compiled:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RAIDS_1.3.3 GENESIS_2.34.0 SNPRelate_1.38.1 gdsfmt_1.40.2
## [5] knitr_1.48 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.9 shape_1.4.6.1
## [3] magrittr_2.0.3 jomo_2.7-6
## [5] GenomicFeatures_1.56.0 logistf_1.26.0
## [7] nloptr_2.1.1 rmarkdown_2.28
## [9] fs_1.6.4 BiocIO_1.14.0
## [11] zlibbioc_1.50.0 ragg_1.3.2
## [13] vctrs_0.6.5 memoise_2.0.1
## [15] minqa_1.2.8 Rsamtools_2.20.0
## [17] RCurl_1.98-1.16 quantsmooth_1.70.0
## [19] htmltools_0.5.8.1 S4Arrays_1.4.1
## [21] curl_5.2.3 broom_1.0.7
## [23] pROC_1.18.5 SparseArray_1.4.8
## [25] mitml_0.4-5 sass_0.4.9
## [27] bslib_0.8.0 htmlwidgets_1.6.4
## [29] desc_1.4.3 plyr_1.8.9
## [31] sandwich_3.1-1 zoo_1.8-12
## [33] cachem_1.1.0 GenomicAlignments_1.40.0
## [35] lifecycle_1.0.4 iterators_1.0.14
## [37] pkgconfig_2.0.3 Matrix_1.7-0
## [39] R6_2.5.1 fastmap_1.2.0
## [41] GenomeInfoDbData_1.2.12 MatrixGenerics_1.16.0
## [43] digest_0.6.37 colorspace_2.1-1
## [45] GWASTools_1.50.0 AnnotationDbi_1.66.0
## [47] S4Vectors_0.42.1 textshaping_0.4.0
## [49] GenomicRanges_1.56.2 RSQLite_2.3.7
## [51] fansi_1.0.6 httr_1.4.7
## [53] abind_1.4-8 mgcv_1.9-1
## [55] compiler_4.4.1 bit64_4.5.2
## [57] backports_1.5.0 BiocParallel_1.38.0
## [59] DBI_1.2.3 pan_1.9
## [61] MASS_7.3-61 quantreg_5.98
## [63] DelayedArray_0.30.1 rjson_0.2.23
## [65] DNAcopy_1.78.0 tools_4.4.1
## [67] lmtest_0.9-40 nnet_7.3-19
## [69] glue_1.8.0 restfulr_0.0.15
## [71] nlme_3.1-165 grid_4.4.1
## [73] generics_0.1.3 gtable_0.3.5
## [75] operator.tools_1.6.3 BSgenome_1.72.0
## [77] formula.tools_1.7.1 class_7.3-22
## [79] tidyr_1.3.1 ensembldb_2.28.1
## [81] data.table_1.16.2 utf8_1.2.4
## [83] XVector_0.44.0 BiocGenerics_0.50.0
## [85] GWASExactHW_1.2 stringr_1.5.1
## [87] foreach_1.5.2 pillar_1.9.0
## [89] splines_4.4.1 dplyr_1.1.4
## [91] lattice_0.22-6 survival_3.7-0
## [93] rtracklayer_1.64.0 bit_4.5.0
## [95] SparseM_1.84-2 tidyselect_1.2.1
## [97] SeqVarTools_1.42.0 Biostrings_2.72.1
## [99] bookdown_0.41 ProtGenerics_1.36.0
## [101] IRanges_2.38.1 SummarizedExperiment_1.34.0
## [103] stats4_4.4.1 xfun_0.48
## [105] Biobase_2.64.0 matrixStats_1.4.1
## [107] stringi_1.8.4 UCSC.utils_1.0.0
## [109] lazyeval_0.2.2 yaml_2.3.10
## [111] boot_1.3-30 evaluate_1.0.1
## [113] codetools_0.2-20 tibble_3.2.1
## [115] BiocManager_1.30.25 cli_3.6.3
## [117] rpart_4.1.23 systemfonts_1.1.0
## [119] munsell_0.5.1 jquerylib_0.1.4
## [121] Rcpp_1.0.13 GenomeInfoDb_1.40.1
## [123] png_0.1-8 XML_3.99-0.17
## [125] parallel_4.4.1 MatrixModels_0.5-3
## [127] ggplot2_3.5.1 pkgdown_2.1.1
## [129] blob_1.2.4 AnnotationFilter_1.28.0
## [131] bitops_1.0-9 lme4_1.1-35.5
## [133] glmnet_4.1-8 SeqArray_1.44.3
## [135] VariantAnnotation_1.50.0 scales_1.3.0
## [137] purrr_1.0.2 crayon_1.5.3
## [139] rlang_1.1.4 KEGGREST_1.44.1
## [141] mice_3.16.0