The function select the optimal K and D parameters for a specific profile. The results on the synthetic data are used for the parameter selection. Once the optimal parameters are selected, the ancestry is inferred for the specific profile.

computeAncestryFromSyntheticFile(
  gdsReference,
  gdsProfile,
  listFiles,
  currentProfile,
  spRef,
  studyIDSyn,
  np = 1L,
  listCatPop = c("EAS", "EUR", "AFR", "AMR", "SAS"),
  fieldPopIn1KG = "superPop",
  fieldPopInfAnc = "SuperPop",
  kList = seq(2, 15, 1),
  pcaList = seq(2, 15, 1),
  algorithm = c("exact", "randomized"),
  eigenCount = 32L,
  missingRate = NaN,
  verbose = FALSE
)

Arguments

gdsReference

an object of class gds.class (a GDS file), the opened 1KG GDS file.

gdsProfile

an object of class gds.class (a GDS file), the opened Profile GDS file.

listFiles

a vector of character strings representing the name of files that contain the results of ancestry inference done on the synthetic profiles for multiple values of D and K. The files must exist.

currentProfile

a character string representing the profile identifier of the current profile on which ancestry will be inferred.

spRef

a vector of character strings representing the known super population ancestry for the 1KG profiles. The 1KG profile identifiers are used as names for the vector.

studyIDSyn

a character string corresponding to the study identifier. The study identifier must be present in the GDS Sample file.

np

a single positive integer representing the number of threads. Default: 1L.

listCatPop

a vector of character string representing the list of possible ancestry assignations. Default: ("EAS", "EUR", "AFR", "AMR", "SAS").

fieldPopIn1KG

a character string representing the name of the column that contains the known ancestry for the reference profiles in the Reference GDS file.

fieldPopInfAnc

a character string representing the name of the column that will contain the inferred ancestry for the specified profiles. Default: "SuperPop".

kList

a vector of integer representing the list of values tested for the K parameter. The K parameter represents the number of neighbors used in the K-nearest neighbor analysis. If NULL, the value seq(2,15,1) is assigned. Default: seq(2,15,1).

pcaList

a vector of integer representing the list of values tested for the D parameter. The D parameter represents the number of dimensions used in the PCA analysis. If NULL, the value seq(2,15,1) is assigned. Default: seq(2,15,1).

algorithm

a character string representing the algorithm used to calculate the PCA. The 2 choices are "exact" (traditional exact calculation) and "randomized" (fast PCA with randomized algorithm introduced in Galinsky et al. 2016). Default: "exact".

eigenCount

a single integer indicating the number of eigenvectors that will be in the output of the snpgdsPCA function; if 'eigenCount' <= 0, then all eigenvectors are returned. Default: 32L.

missingRate

a numeric value representing the threshold missing rate at with the SNVs are discarded; the SNVs are retained in the snpgdsPCA with "<= missingRate" only; if NaN, no missing threshold. Default: NaN.

verbose

a logical indicating if messages should be printed to show how the different steps in the function. Default: FALSE.

Value

a list containing 4 entries:

pcaSample

a list containing the information related to the eigenvectors. The list contains those 3 entries:

sample.id

a character string representing the unique identifier of the current profile.

eigenvector.ref

a matrix of numeric containing the eigenvectors for the reference profiles.

eigenvector

a matrix of numeric containing the eigenvectors for the current profile projected on the PCA from the reference profiles.

paraSample

a list containing the results with different D and K values that lead to optimal parameter selection. The list contains those entries:

dfPCA

a data.frame containing statistical results on all combined synthetic results done with a fixed value of D (the number of dimensions). The data.frame contains those columns:

D

a numeric representing the value of D (the number of dimensions).

median

a numeric representing the median of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.

mad

a numeric representing the MAD of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.

upQuartile

a numeric representing the upper quartile of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.

k

a numeric representing the optimal K value (the number of neighbors) for a fixed D value.

dfPop

a data.frame containing statistical results on all combined synthetic results done with different values of D (the number of dimensions) and K (the number of neighbors). The data.frame contains those columns:

D

a numeric representing the value of D (the number of dimensions).

K

a numeric representing the value of K (the number of neighbors).

AUROC.min

a numeric representing the minimum accuracy obtained by grouping all the synthetic results by super-populations, for the specified values of D and K.

AUROC

a numeric representing the accuracy obtained by grouping all the synthetic results for the specified values of D and K.

Accu.CM

a numeric representing the value of accuracy of the confusion matrix obtained by grouping all the synthetic results for the specified values of D and K.

dfAUROC

a data.frame the summary of the results by super-population. The data.frame contains those columns:

pcaD

a numeric representing the value of D (the number of dimensions).

K

a numeric representing the value of K (the number of neighbors).

Call

a character string representing the super-population.

L

a numeric representing the lower value of the 95% confidence interval for the AUROC obtained for the fixed values of super-population, D and K.

AUR

a numeric representing the AUROC obtained for the fixed values of super-population, D and K.

H

a numeric representing the higher value of the 95% confidence interval for the AUROC obtained for the fixed values of super-population, D and K.

D

a numeric representing the optimal D value (the number of dimensions) for the specific profile.

K

a numeric representing the optimal K value (the number of neighbors) for the specific profile.

listD

a numeric representing the optimal D values (the number of dimensions) for the specific profile. More than one D is possible.

KNNSample

a list containing the inferred ancestry using different D and K values. The list contains those entries:

sample.id

a character string representing the unique identifier of the current profile.

matKNN

a data.frame containing the inferred ancestry for different values of K and D. The data.frame contains those columns:

sample.id

a character string representing the unique identifier of the current profile.

D

a numeric representing the value of D (the number of dimensions) used to infer the ancestry.

K

a numeric representing the value of K (the number of neighbors) used to infer the ancestry.

SuperPop

a character string representing the inferred ancestry for the specified D and K values.

Ancestry

a data.frame containing the inferred ancestry for the current profile. The data.frame contains those columns:

sample.id

a character string representing the unique identifier of the current profile.

D

a numeric representing the value of D (the number of dimensions) used to infer the ancestry.

K

a numeric representing the value of K (the number of neighbors) used to infer the ancestry.

SuperPop

a character string representing the inferred ancestry.

References

Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.

Author

Pascal Belleau, Astrid Deschênes and Alexander Krasnitz

Examples



## Required library
library(gdsfmt)

## Load the known ancestry for the demo 1KG reference profiles
data(demoKnownSuperPop1KG)

## The Reference GDS file
path1KG <- system.file("extdata/tests", package="RAIDS")

## Open the Reference GDS file
gdsRef <- snpgdsOpen(file.path(path1KG, "ex1_good_small_1KG.gds"))

## Path to the demo synthetic results files
## List of the KNN result files from PCA run on synthetic data
dataDirRes <- system.file("extdata/demoAncestryCall/ex1", package="RAIDS")
listFilesName <- dir(file.path(dataDirRes), ".rds")
listFiles <- file.path(file.path(dataDirRes) , listFilesName)

# The name of the synthetic study
studyID <- "MYDATA.Synthetic"

## Path to the demo Profile GDS file is located in this package
dataDir <- system.file("extdata/demoAncestryCall", package="RAIDS")

## Open the Profile GDS file
gdsProfile <- snpgdsOpen(file.path(dataDir, "ex1.gds"))

## Run the ancestry inference on one profile called 'ex1'
## The values of K and D used for the inference are selected using the
## synthetic results
resCall <- computeAncestryFromSyntheticFile(gdsReference=gdsRef,
                            gdsProfile=gdsProfile,
                            listFiles=listFiles,
                            currentProfile=c("ex1"),
                            spRef=demoKnownSuperPop1KG,
                            studyIDSyn=studyID, np=1L)

## The ancestry called with the optimal D and K values
resCall$Ancestry
#>    sample.id D K SuperPop
#> 77       ex1 7 8      EAS

## Close the GDS files (important)
closefn.gds(gdsProfile)
closefn.gds(gdsRef)