R/processStudy.R
runExomeAncestry.Rd
This function runs most steps leading to the ancestry inference call on a specific exome profile. First, the function creates the Profile GDS file for the specific profile using the information from a RDS Sample description file and the Population reference GDS file.
runExomeAncestry(
pedStudy,
studyDF,
pathProfileGDS,
pathGeno,
pathOut,
fileReferenceGDS,
fileReferenceAnnotGDS,
chrInfo,
syntheticRefDF,
genoSource = c("snp-pileup", "generic", "VCF"),
np = 1L,
verbose = FALSE
)
a data.frame
with those mandatory columns: "Name.ID",
"Case.ID", "Sample.Type", "Diagnosis", "Source". All columns must be in
character
strings (no factor). The data.frame
must contain the information for all the samples passed in the
listSamples
parameter. Only filePedRDS
or pedStudy
can be defined.
a data.frame
containing the information about the
study associated to the analysed sample(s). The data.frame
must have
those 3 columns: "study.id", "study.desc", "study.platform". All columns
must be in character
strings (no factor).
a character
string representing the path to
the directory where the GDS Profile files will be created.
Default: NULL
.
a character
string representing the path to the
directory containing the VCF output of SNP-pileup for each sample. The
SNP-pileup files must be compressed (gz files) and have the name identifiers
of the samples. A sample with "Name.ID" identifier would have an
associated file called
if genoSource is "VCF", then "Name.ID.vcf.gz",
if genoSource is "generic", then "Name.ID.generic.txt.gz"
if genoSource is "snp-pileup", then "Name.ID.txt.gz".
a character
string representing the path to
the directory where the output files are created.
a character
string representing the file
name of the Reference GDS file. The file must exist.
a character
string representing the
file name of the Population Reference GDS Annotation file. The file must
exist.
a vector
of positive integer
values
representing the length of the chromosomes. See 'details' section.
a data.frame
containing a subset of
reference profiles for each sub-population present in the Reference GDS
file. The data.frame
must have those columns:
a character
string representing the sample
identifier.
a character
string representing the
subcontinental population assigned to the sample.
a character
string representing the
super-population assigned to the sample.
a character
string with two possible values:
'snp-pileup', 'generic' or 'VCF'. It specifies if the genotype files
are generated by snp-pileup (Facets) or are a generic format CSV file
with at least those columns:
'Chromosome', 'Position', 'Ref', 'Alt', 'Count', 'File1R' and 'File1A'.
The 'Count' is the depth at the specified position;
'FileR' is the depth of the reference allele and
'File1A' is the depth of the specific alternative allele.
Finally the file can be a VCF file with at least those genotype
fields: GT, AD, DP.
a single positive integer
specifying the number of
threads to be used. Default: 1L
.
a logical
indicating if messages should be printed
to show how the different steps in the function. Default: FALSE
.
The integer 0L
when successful. See details section for
more information about the generated output files.
The runExomeAncestry() function generates 3 types of files in the OUTPUT directory.
The ancestry inference CSV file (".Ancestry.csv" file)
The inference information RDS file (".infoCall.rds" file)
The parameter information RDS files from the synthetic inference ("KNN.synt.*.rds" files in a sub-directory)
In addition, a sub-directory (named using the profile ID) is also created.
Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.
## Required library for GDS
library(SNPRelate)
## Path to the demo 1KG GDS file is located in this package
dataDir <- system.file("extdata", package="RAIDS")
#################################################################
## Load the information about the profile
#################################################################
data(demoPedigreeEx1)
head(demoPedigreeEx1)
#> Name.ID Case.ID Sample.Type Diagnosis Source
#> ex1 ex1 ex1 Primary Tumor Cancer Databank B
#################################################################
## The 1KG GDS file and the 1KG SNV Annotation GDS file
## need to be located in the same directory
## Note that the 1KG GDS file used for this example is a
## simplified version and CANNOT be used for any real analysis
#################################################################
path1KG <- file.path(dataDir, "tests")
fileReferenceGDS <- file.path(path1KG, "ex1_good_small_1KG.gds")
fileAnnotGDS <- file.path(path1KG, "ex1_good_small_1KG_Annot.gds")
#################################################################
## The Sample SNP pileup files (one per sample) need
## to be located in the same directory.
#################################################################
pathGeno <- file.path(dataDir, "example", "snpPileup")
#################################################################
## The path where the Profile GDS Files (one per sample)
## will be created need to be specified.
#################################################################
pathProfileGDS <- file.path(tempdir(), "out.tmp")
pathOut <- file.path(tempdir(), "res.out")
#################################################################
## A data frame containing general information about the study
## is also required. The data frame must have
## those 3 columns: "studyID", "study.desc", "study.platform"
#################################################################
studyDF <- data.frame(study.id="MYDATA",
study.desc="Description",
study.platform="PLATFORM",
stringsAsFactors=FALSE)
####################################################################
## Fix seed to ensure reproducible results
####################################################################
set.seed(3043)
gds1KG <- snpgdsOpen(fileReferenceGDS)
dataRef <- select1KGPop(gds1KG, nbProfiles=2L)
closefn.gds(gds1KG)
## Required library for this example to run correctly
if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
## Chromosome length information
## chr23 is chrX, chr24 is chrY and chrM is 25
chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25]
# \donttest{
runExomeAncestry(pedStudy=demoPedigreeEx1, studyDF=studyDF,
pathProfileGDS=pathProfileGDS,
pathGeno=pathGeno,
pathOut=pathOut,
fileReferenceGDS=fileReferenceGDS,
fileReferenceAnnotGDS=fileAnnotGDS,
chrInfo=chrInfo,
syntheticRefDF=dataRef,
genoSource="snp-pileup")
unlink(pathProfileGDS, recursive=TRUE, force=TRUE)
unlink(pathOut, recursive=TRUE, force=TRUE)
# }
}