R/processStudy.R
inferAncestryGeneAware.Rd
This function runs most steps leading to the ancestry inference call on a specific RNA profile. First, the function creates the Profile GDS file for the specific profile using the information from a RDS Sample description file and the Population Reference GDS file.
inferAncestryGeneAware(
profileFile,
pathProfileGDS,
fileReferenceGDS,
fileReferenceAnnotGDS,
chrInfo,
syntheticRefDF,
genoSource = c("snp-pileup", "generic", "VCF", "bam"),
np = 1L,
blockTypeID,
verbose = FALSE
)
a character
string representing the path and the
file name of the genotype file or the bam if genoSource is snp-pileup the
fine extension must be .txt.gz, if VCF the extension must be .vcf.gz
a character
string representing the path to
the directory where the GDS Profile files will be created.
Default: NULL
.
a character
string representing the file
name of the Population Reference GDS file. The file must exist.
a character
string representing the
file name of the Population Reference GDS Annotation file. The file
must exist.
a vector
of positive integer
values
representing the length of the chromosomes. See 'details' section.
a data.frame
containing a subset of
reference profiles for each sub-population present in the Reference GDS
file. The data.frame
must have those columns:
a character
string representing the sample
identifier.
a character
string representing the
subcontinental population assigned to the sample.
a character
string representing the
super-population assigned to the sample.
a character
string with four possible values:
'snp-pileup', 'generic', 'VCF' or 'bam'. It specifies if the genotype files
are generated by snp-pileup (Facets) or are a generic format CSV file
with at least those columns:
'Chromosome', 'Position', 'Ref', 'Alt', 'Count', 'File1R' and 'File1A'.
The 'Count' is the depth at the specified position;
'FileR' is the depth of the reference allele and
'File1A' is the depth of the specific alternative allele.
Finally the file can be a VCF file with at least those genotype
fields: GT, AD, DP.
a single positive integer
specifying the number of
threads to be used. Default: 1L
.
a character
string corresponding to the block
type used to extract the block identifiers. The block type must be
present in the GDS Reference Annotation file.
a logical
indicating if messages should be printed
to show how the different steps in the function. Default: FALSE
.
a list
containing 4 entries:
pcaSample
a list
containing the information related
to the eigenvectors. The list
contains those 3 entries:
sample.id
a character
string representing the unique
identifier of the current profile.
eigenvector.ref
a matrix
of numeric
containing
the eigenvectors for the reference profiles.
eigenvector
a matrix
of numeric
containing the
eigenvectors for the current profile projected on the PCA from the
reference profiles.
paraSample
a list
containing the results with
different D
and K
values that lead to optimal parameter
selection. The list
contains those entries:
dfPCA
a data.frame
containing statistical results
on all combined synthetic results done with a fixed value of D
(the
number of dimensions). The data.frame
contains those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
median
a numeric
representing the median of the
minimum AUROC obtained (within super populations) for all combination of
the fixed D
value and all tested K
values.
mad
a numeric
representing the MAD of the minimum
AUROC obtained (within super populations) for all combination of the fixed
D
value and all tested K
values.
upQuartile
a numeric
representing the upper quartile
of the minimum AUROC obtained (within super populations) for all
combination of the fixed D
value and all tested K
values.
k
a numeric
representing the optimal K
value
(the number of neighbors) for a fixed D
value.
dfPop
a data.frame
containing statistical results on
all combined synthetic results done with different values of D
(the
number of dimensions) and K
(the number of neighbors).
The data.frame
contains those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
K
a numeric
representing the value of K
(the
number of neighbors).
AUROC.min
a numeric
representing the minimum accuracy
obtained by grouping all the synthetic results by super-populations, for
the specified values of D
and K
.
AUROC
a numeric
representing the accuracy obtained
by grouping all the synthetic results for the specified values of D
and K
.
Accu.CM
a numeric
representing the value of accuracy
of the confusion matrix obtained by grouping all the synthetic results for
the specified values of D
and K
.
dfAUROC
a data.frame
the summary of the results by
super-population. The data.frame
contains
those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
K
a numeric
representing the value of K
(the
number of neighbors).
Call
a character
string representing the
super-population.
L
a numeric
representing the lower value of the 95%
confidence interval for the AUROC obtained for the fixed values of
super-population, D
and K
.
AUROC
a numeric
representing the AUROC obtained for
the fixed values of super-population, D
and K
.
H
a numeric
representing the higher value of the 95%
confidence interval for the AUROC obtained for the fixed values of
super-population, D
and K
.
D
a numeric
representing the optimal D
value
(the number of dimensions) for the specific profile.
K
a numeric
representing the optimal K
value
(the number of neighbors) for the specific profile.
listD
a numeric
representing the optimal D
values (the number of dimensions) for the specific profile. More than one
D
is possible.
KNNSample
a data.frame
containing the inferred
ancestry for different values of K
and D
. The
data.frame
contains those columns:
sample.id
a character
string representing the unique
identifier of the current profile.
D
a numeric
representing the value of D
(the
number of dimensions) used to infer the ancestry.
K
a numeric
representing the value of K
(the
number of neighbors) used to infer the ancestry.
SuperPop
a character
string representing the inferred
ancestry for the specified D
and K
values.
KNNSynthetic
a data.frame
containing the inferred
ancestry for each synthetic data for different values of K
and
D
.
The data.frame
contains those columns:
sample.id
a character
string representing the unique
identifier of the current synthetic data.
D
a numeric
representing the value of D
(the
number of dimensions) used to infer the ancestry.
K
a numeric
representing the value of K
(the
number of neighbors) used to infer the ancestry.
infer.superPop
a character
string representing the
inferred ancestry for the specified D
and K
values.
ref.superPop
a character
string representing the known
ancestry from the reference
Ancestry
a data.frame
containing the inferred
ancestry for the current profile. The data.frame
contains those
columns:
sample.id
a character
string representing the unique
identifier of the current profile.
D
a numeric
representing the value of D
(the
number of dimensions) used to infer the ancestry.
K
a numeric
representing the value of K
(the
number of neighbors) used to infer the ancestry.
SuperPop
a character
string representing the inferred
ancestry.
The runExomeAncestry() function generates 3 types of files in the OUTPUT directory.
The ancestry inference CSV file (".Ancestry.csv" file)
The inference information RDS file (".infoCall.rds" file)
The parameter information RDS files from the synthetic inference ("KNN.synt.*.rds" files in a sub-directory)
In addition, a sub-directory (named using the profile ID) is also created.
Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.
## Required library for GDS
library(SNPRelate)
## Path to the demo 1KG GDS file is located in this package
dataDir <- system.file("extdata", package="RAIDS")
#################################################################
## The 1KG GDS file and the 1KG SNV Annotation GDS file
## need to be located in the same directory
## Note that the 1KG GDS file used for this example is a
## simplified version and CANNOT be used for any real analysis
#################################################################
path1KG <- file.path(dataDir, "tests")
fileReferenceGDS <- file.path(path1KG, "ex1_good_small_1KG.gds")
fileAnnotGDS <- file.path(path1KG, "ex1_good_small_1KG_Annot.gds")
#################################################################
## The Sample SNP pileup files (one per sample) need
## to be located in the same directory.
#################################################################
demoProfileEx1 <- file.path(dataDir, "example", "snpPileup", "ex1.txt.gz")
#################################################################
## The path where the Profile GDS Files (one per sample)
## will be created need to be specified.
#################################################################
pathProfileGDS <- file.path(tempdir(), "out.tmp")
####################################################################
## Fix seed to ensure reproducible results
####################################################################
set.seed(3043)
gds1KG <- snpgdsOpen(fileReferenceGDS)
dataRef <- select1KGPop(gds1KG, nbProfiles=2L)
closefn.gds(gds1KG)
## Required library for this example to run correctly
if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
## Chromosome length information
## chr23 is chrX, chr24 is chrY and chrM is 25
chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25]
# \donttest{
res <- inferAncestryGeneAware(profileFile=demoProfileEx1,
pathProfileGDS=pathProfileGDS,
fileReferenceGDS=fileReferenceGDS,
fileReferenceAnnotGDS=fileAnnotGDS,
chrInfo=chrInfo,
syntheticRefDF=dataRef,
blockTypeID="GeneS.Ensembl.Hsapiens.v86",
genoSource="snp-pileup")
unlink(pathProfileGDS, recursive=TRUE, force=TRUE)
# }
}