R/processStudy.R
computeKNNRefSynthetic.Rd
The function runs k-nearest neighbors analysis on a subset of the synthetic data set. The function uses the 'knn' package.
an object of class
SNPRelate::SNPGDSFileClass
, the
opened Profile GDS file.
a list
with 3 entries:
'sample.id', 'eigenvector.ref' and 'eigenvector'. The list
represents
the PCA done on the 1KG reference profiles and the synthetic profiles
projected onto it.
a vector
of character
string
representing the list of possible ancestry assignations. Default:
c("EAS", "EUR", "AFR", "AMR", "SAS")
.
a character
string corresponding to the study
identifier.
The study identifier must be present in the Profile GDS file.
vector
of character
strings representing the
known super population ancestry for the 1KG profiles. The 1KG profile
identifiers are used as names for the vector
.
a character
string representing the name of
the column that will contain the inferred ancestry for the specified
data set. Default: "SuperPop"
.
a vector
of integer
representing the list of
values tested for the K parameter. The K parameter represents the
number of neighbors used in the K-nearest neighbors analysis. If
NULL
, the value seq(2, 15, 1)
is assigned.
Default: seq(2, 15, 1)
.
a vector
of integer
representing the list of
values tested for the D parameter. The D parameter represents the
number of dimensions used in the PCA analysis. If NULL
,
the value seq(2, 15, 1)
is assigned.
Default: seq(2, 15, 1)
.
a list
containing 4 entries:
sample.id
a vector
of character
strings
representing the identifiers of the synthetic profiles analysed.
sample1Kg
a vector
of character
strings
representing the identifiers of the 1KG reference profiles used to
generate the synthetic profiles.
sp
a vector
of character
strings representing
the known super population ancestry of the 1KG reference profiles used
to generate the synthetic profiles.
matKNN
a data.frame
containing the super population
inference for each synthetic profiles for different values of PCA
dimensions D
and k-neighbors values K
. The fourth column title
corresponds to the fieldPopInfAnc
parameter.
The data.frame
contains 4 columns:
sample.id
a character
string representing
the identifier of the synthetic profile analysed.
D
a numeric
strings representing
the value of the PCA dimension used to infer the super population.
K
a numeric
strings representing
the value of the k-neighbors used to infer the super population.
fieldPopInfAnc
valuea character
string representing
the inferred ancestry.
## Required library
library(gdsfmt)
## Load the demo PCA on the synthetic profiles projected on the
## demo 1KG reference PCA
data(demoPCASyntheticProfiles)
## Load the known ancestry for the demo 1KG reference profiles
data(demoKnownSuperPop1KG)
## Path to the demo Profile GDS file is located in this package
dataDir <- system.file("extdata/demoKNNSynthetic", package="RAIDS")
## Open the Profile GDS file
gdsProfile <- snpgdsOpen(file.path(dataDir, "ex1.gds"))
# The name of the synthetic study
studyID <- "MYDATA.Synthetic"
## Projects synthetic profiles on 1KG PCA
results <- computeKNNRefSynthetic(gdsProfile=gdsProfile,
listEigenvector=demoPCASyntheticProfiles,
listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"), studyIDSyn=studyID,
spRef=demoKnownSuperPop1KG)
## The inferred ancestry for the synthetic profiles for different values
## of D and K
head(results$matKNN)
#> sample.id D K SuperPop
#> 1 1.ex1.HG00246.1 2 2 SAS
#> 2 1.ex1.HG00246.1 2 3 EAS
#> 3 1.ex1.HG00246.1 2 4 AMR
#> 4 1.ex1.HG00246.1 2 5 EUR
#> 5 1.ex1.HG00246.1 2 6 EUR
#> 6 1.ex1.HG00246.1 2 7 EAS
## Close Profile GDS file (important)
closefn.gds(gdsProfile)