R/processStudy_internal.R
computeAncestryFromSynthetic.Rd
The function select the optimal K and D parameters for a specific profile. The results on the synthetic data are used for the parameter selection. Once the optimal parameters are selected, the ancestry is inferred for the specific profile.
computeAncestryFromSynthetic(
gdsReference,
gdsProfile,
syntheticKNN,
pedSyn,
currentProfile,
spRef,
studyIDSyn,
np = 1L,
listCatPop = c("EAS", "EUR", "AFR", "AMR", "SAS"),
fieldPopIn1KG = "superPop",
fieldPopInfAnc = "SuperPop",
kList = seq(2, 15, 1),
pcaList = seq(2, 15, 1),
algorithm = c("exact", "randomized"),
eigenCount = 32L,
missingRate = NaN,
verbose = FALSE
)
an object of class gds.class (a GDS file), the opened 1KG GDS file.
an object of class gds.class
(a GDS file), the opened Profile GDS file.
a vector
of character
strings representing
the name of files that contain the results of ancestry inference done on
the synthetic profiles for multiple values of D and K. The files must
exist.
a data.frame
containing the columns extracted from the
GDS Sample 'study.annot' node with a extra column named as the 'popName'
parameter that has been extracted from the 1KG GDS 'sample.annot' node.
a character
string representing the profile
identifier of the current profile on which ancestry will be inferred.
a vector
of character
strings representing the
known super population ancestry for the 1KG profiles. The 1KG profile
identifiers are used as names for the vector
.
a character
string corresponding to the study
identifier. The study identifier must be present in the GDS Sample file.
a single positive integer
representing the number of
threads. Default: 1L
.
a vector
of character
string
representing the list of possible ancestry assignations. Default:
("EAS", "EUR", "AFR", "AMR", "SAS")
.
a character
string representing the name of the
column that contains the known ancestry for the reference profiles in
the Reference GDS file.
a character
string representing the name of
the column that will contain the inferred ancestry for the specified
profiles. Default: "SuperPop"
.
a vector
of integer
representing the list of
values tested for the K parameter. The K parameter represents the
number of neighbors used in the K-nearest neighbor analysis. If NULL
,
the value seq(2,15,1)
is assigned.
Default: seq(2,15,1)
.
a vector
of integer
representing the list of
values tested for the D parameter. The D parameter represents the
number of dimensions used in the PCA analysis. If NULL
,
the value seq(2,15,1)
is assigned.
Default: seq(2,15,1)
.
a character
string representing the algorithm used
to calculate the PCA. The 2 choices are "exact" (traditional exact
calculation) and "randomized" (fast PCA with randomized algorithm
introduced in Galinsky et al. 2016). Default: "exact"
.
a single integer
indicating the number of
eigenvectors that will be in the output of the snpgdsPCA
function; if 'eigenCount' <= 0, then all eigenvectors are returned.
Default: 32L
.
a numeric
value representing the threshold
missing rate at with the SNVs are discarded; the SNVs are retained in the
snpgdsPCA
with "<= missingRate" only; if NaN
, no missing threshold.
Default: NaN
.
a logical
indicating if messages should be printed
to show how the different steps in the function. Default: FALSE
.
a list
containing 4 entries:
pcaSample
a list
containing the information related
to the eigenvectors. The list
contains those 3 entries:
sample.id
a character
string representing the unique
identifier of the current profile.
eigenvector.ref
a matrix
of numeric
containing
the eigenvectors for the reference profiles.
eigenvector
a matrix
of numeric
containing the
eigenvectors for the current profile projected on the PCA from the
reference profiles.
paraSample
a list
containing the results with
different D
and K
values that lead to optimal parameter
selection. The list
contains those entries:
dfPCA
a data.frame
containing statistical results
on all combined synthetic results done with a fixed value of D
(the
number of dimensions). The data.frame
contains those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
median
a numeric
representing the median of the
minimum AUROC obtained (within super populations) for all combination of
the fixed D
value and all tested K
values.
mad
a numeric
representing the MAD of the minimum
AUROC obtained (within super populations) for all combination of the fixed
D
value and all tested K
values.
upQuartile
a numeric
representing the upper quartile
of the minimum AUROC obtained (within super populations) for all
combination of the fixed D
value and all tested K
values.
k
a numeric
representing the optimal K
value
(the number of neighbors) for a fixed D
value.
dfPop
a data.frame
containing statistical results on
all combined synthetic results done with different values of D
(the
number of dimensions) and K
(the number of neighbors).
The data.frame
contains those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
K
a numeric
representing the value of K
(the
number of neighbors).
AUROC.min
a numeric
representing the minimum accuracy
obtained by grouping all the synthetic results by super-populations, for
the specified values of D
and K
.
AUROC
a numeric
representing the accuracy obtained
by grouping all the synthetic results for the specified values of D
and K
.
Accu.CM
a numeric
representing the value of accuracy
of the confusion matrix obtained by grouping all the synthetic results for
the specified values of D
and K
.
dfAUROC
a data.frame
the summary of the results by
super-population. The data.frame
contains
those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
K
a numeric
representing the value of K
(the
number of neighbors).
Call
a character
string representing the
super-population.
L
a numeric
representing the lower value of the 95%
confidence interval for the AUROC obtained for the fixed values of
super-population, D
and K
.
AUROC
a numeric
representing the AUROC obtained for the
fixed values of super-population, D
and K
.
H
a numeric
representing the higher value of the 95%
confidence interval for the AUROC obtained for the fixed values of
super-population, D
and K
.
D
a numeric
representing the optimal D
value
(the number of dimensions) for the specific profile.
K
a numeric
representing the optimal K
value
(the number of neighbors) for the specific profile.
listD
a numeric
representing the optimal D
values (the number of dimensions) for the specific profile. More than one
D
is possible.
KNNSample
a list
containing the inferred ancestry
using different D
and K
values. The list
contains
those entries:
sample.id
a character
string representing the unique
identifier of the current profile.
matKNN
a data.frame
containing the inferred ancestry
for different values of K
and D
. The data.frame
contains those columns:
sample.id
a character
string representing the unique
identifier of the current profile.
D
a numeric
representing the value of D
(the
number of dimensions) used to infer the ancestry.
K
a numeric
representing the value of K
(the
number of neighbors) used to infer the ancestry.
SuperPop
a character
string representing the inferred
ancestry for the specified D
and K
values.
Ancestry
a data.frame
containing the inferred
ancestry for the current profile. The data.frame
contains those
columns:
sample.id
a character
string representing the unique
identifier of the current profile.
D
a numeric
representing the value of D
(the
number of dimensions) used to infer the ancestry.
K
a numeric
representing the value of K
(the
number of neighbors) used to infer the ancestry.
SuperPop
a character
string representing the inferred
ancestry.
Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.
## Required library
library(gdsfmt)
## Load the known ancestry for the demo 1KG reference profiles
data(demoKnownSuperPop1KG)
## The Reference GDS file
path1KG <- system.file("extdata/tests", package="RAIDS")
## Open the Reference GDS file
gdsRef <- snpgdsOpen(file.path(path1KG, "ex1_good_small_1KG.gds"))
## Path to the demo synthetic results files
## List of the KNN result files from PCA run on synthetic data
dataDirRes <- system.file("extdata/demoAncestryCall/ex1", package="RAIDS")
listFilesName <- dir(file.path(dataDirRes), ".rds")
listFiles <- file.path(file.path(dataDirRes) , listFilesName)
syntheticKNN <- lapply(listFiles, FUN=function(x){return(readRDS(x))})
syntheticKNN <- do.call(rbind, syntheticKNN)
# The name of the synthetic study
studyID <- "MYDATA.Synthetic"
## Path to the demo Profile GDS file is located in this package
dataDir <- system.file("extdata/demoAncestryCall", package="RAIDS")
## Open the Profile GDS file
gdsProfile <- snpgdsOpen(file.path(dataDir, "ex1.gds"))
if (FALSE) { # \dontrun{
pedSyn <- RAIDS:::prepPedSynthetic1KG(gdsReference=gdsRef,
gdsSample=gdsProfile, studyID=studyID, popName="superPop")
## Run the ancestry inference on one profile called 'ex1'
## The values of K and D used for the inference are selected using the
## synthetic results listFiles=listFiles,
resCall <- RAIDS:::computeAncestryFromSynthetic(gdsReference=gdsRef,
gdsProfile=gdsProfile,
syntheticKNN = syntheticKNN,
pedSyn = pedSyn,
currentProfile=c("ex1"),
spRef=demoKnownSuperPop1KG,
studyIDSyn=studyID, np=1L)
## The ancestry called with the optimal D and K values
resCall$Ancestry
} # }
## Close the GDS files (important)
closefn.gds(gdsProfile)
closefn.gds(gdsRef)