The function runs k-nearest neighbors analysis on a subset of the synthetic data set. The function uses the 'knn' package.

computeKNNRefSynthetic(
  gdsProfile,
  listEigenvector,
  listCatPop = c("EAS", "EUR", "AFR", "AMR", "SAS"),
  studyIDSyn,
  spRef,
  fieldPopInfAnc = "SuperPop",
  kList = seq(2, 15, 1),
  pcaList = seq(2, 15, 1)
)

Arguments

gdsProfile

an object of class SNPRelate::SNPGDSFileClass, the opened Profile GDS file.

listEigenvector

a list with 3 entries: 'sample.id', 'eigenvector.ref' and 'eigenvector'. The list represents the PCA done on the 1KG reference profiles and the synthetic profiles projected onto it.

listCatPop

a vector of character string representing the list of possible ancestry assignations. Default: c("EAS", "EUR", "AFR", "AMR", "SAS").

studyIDSyn

a character string corresponding to the study identifier. The study identifier must be present in the Profile GDS file.

spRef

vector of character strings representing the known super population ancestry for the 1KG profiles. The 1KG profile identifiers are used as names for the vector.

fieldPopInfAnc

a character string representing the name of the column that will contain the inferred ancestry for the specified data set. Default: "SuperPop".

kList

a vector of integer representing the list of values tested for the K parameter. The K parameter represents the number of neighbors used in the K-nearest neighbors analysis. If NULL, the value seq(2, 15, 1) is assigned. Default: seq(2, 15, 1).

pcaList

a vector of integer representing the list of values tested for the D parameter. The D parameter represents the number of dimensions used in the PCA analysis. If NULL, the value seq(2, 15, 1) is assigned. Default: seq(2, 15, 1).

Value

a list containing 4 entries:

sample.id

a vector of character strings representing the identifiers of the synthetic profiles analysed.

sample1Kg

a vector of character strings representing the identifiers of the 1KG reference profiles used to generate the synthetic profiles.

sp

a vector of character strings representing the known super population ancestry of the 1KG reference profiles used to generate the synthetic profiles.

matKNN

a data.frame containing the super population inference for each synthetic profiles for different values of PCA dimensions D and k-neighbors values K. The fourth column title corresponds to the fieldPopInfAnc parameter. The data.frame contains 4 columns:

sample.id

a character string representing the identifier of the synthetic profile analysed.

D

a numeric strings representing the value of the PCA dimension used to infer the super population.

K

a numeric strings representing the value of the k-neighbors used to infer the super population.

fieldPopInfAnc value

a character string representing the inferred ancestry.

Author

Pascal Belleau, Astrid Deschênes and Alexander Krasnitz

Examples


## Required library
library(gdsfmt)

## Load the demo PCA on the synthetic profiles projected on the
## demo 1KG reference PCA
data(demoPCASyntheticProfiles)

## Load the known ancestry for the demo 1KG reference profiles
data(demoKnownSuperPop1KG)

## Path to the demo Profile GDS file is located in this package
dataDir <- system.file("extdata/demoKNNSynthetic", package="RAIDS")

## Open the Profile GDS file
gdsProfile <- snpgdsOpen(file.path(dataDir, "ex1.gds"))

# The name of the synthetic study
studyID <- "MYDATA.Synthetic"

## Projects synthetic profiles on 1KG PCA
results <- computeKNNRefSynthetic(gdsProfile=gdsProfile,
    listEigenvector=demoPCASyntheticProfiles,
    listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"), studyIDSyn=studyID,
    spRef=demoKnownSuperPop1KG)

## The inferred ancestry for the synthetic profiles for different values
## of D and K
head(results$matKNN)
#>         sample.id D K SuperPop
#> 1 1.ex1.HG00246.1 2 2      SAS
#> 2 1.ex1.HG00246.1 2 3      EAS
#> 3 1.ex1.HG00246.1 2 4      AMR
#> 4 1.ex1.HG00246.1 2 5      EUR
#> 5 1.ex1.HG00246.1 2 6      EUR
#> 6 1.ex1.HG00246.1 2 7      EAS

## Close Profile GDS file (important)
closefn.gds(gdsProfile)