R/processStudy.R
inferAncestry.Rd
This function runs most steps leading to the ancestry inference call on a specific RNA profile. First, the function creates the Profile GDS file for the specific profile using the information from a RDS Sample description file and the Population Reference GDS file.
inferAncestry(
profileFile,
pathProfileGDS,
fileReferenceGDS,
fileReferenceAnnotGDS,
chrInfo,
syntheticRefDF,
genoSource = c("snp-pileup", "generic", "VCF", "bam"),
np = 1L,
verbose = FALSE
)
a character
string representing the path and the
file name of the genotype file or the bam if genoSource is snp-pileup the
fine extension must be .txt.gz, if VCF the extension must be .vcf.gz
a character
string representing the path to
the directory where the GDS Profile files will be created.
Default: NULL
.
a character
string representing the file
name of the Population Reference GDS file. The file must exist.
a character
string representing the
file name of the Population Reference GDS Annotation file. The file
must exist.
a vector
of positive integer
values
representing the length of the chromosomes. See 'details' section.
a data.frame
containing a subset of
reference profiles for each sub-population present in the Reference GDS
file. The data.frame
must have those columns:
a character
string representing the sample
identifier.
a character
string representing the
subcontinental population assigned to the sample.
a character
string representing the
super-population assigned to the sample.
a character
string with four possible values:
'snp-pileup', 'generic', 'VCF' or 'bam'. It specifies if the genotype files
are generated by snp-pileup (Facets) or are a generic format CSV file
with at least those columns:
'Chromosome', 'Position', 'Ref', 'Alt', 'Count', 'File1R' and 'File1A'.
The 'Count' is the depth at the specified position;
'FileR' is the depth of the reference allele and
'File1A' is the depth of the specific alternative allele.
Finally the file can be a VCF file with at least those genotype
fields: GT, AD, DP.
a single positive integer
specifying the number of
threads to be used. Default: 1L
.
a logical
indicating if messages should be printed
to show how the different steps in the function. Default: FALSE
.
a list
containing 4 entries:
pcaSample
a list
containing the information related
to the eigenvectors. The list
contains those 3 entries:
sample.id
a character
string representing the unique
identifier of the current profile.
eigenvector.ref
a matrix
of numeric
containing
the eigenvectors for the reference profiles.
eigenvector
a matrix
of numeric
containing the
eigenvectors for the current profile projected on the PCA from the
reference profiles.
paraSample
a list
containing the results with
different D
and K
values that lead to optimal parameter
selection. The list
contains those entries:
dfPCA
a data.frame
containing statistical results
on all combined synthetic results done with a fixed value of D
(the
number of dimensions). The data.frame
contains those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
median
a numeric
representing the median of the
minimum AUROC obtained (within super populations) for all combination of
the fixed D
value and all tested K
values.
mad
a numeric
representing the MAD of the minimum
AUROC obtained (within super populations) for all combination of the fixed
D
value and all tested K
values.
upQuartile
a numeric
representing the upper quartile
of the minimum AUROC obtained (within super populations) for all
combination of the fixed D
value and all tested K
values.
k
a numeric
representing the optimal K
value
(the number of neighbors) for a fixed D
value.
dfPop
a data.frame
containing statistical results on
all combined synthetic results done with different values of D
(the
number of dimensions) and K
(the number of neighbors).
The data.frame
contains those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
K
a numeric
representing the value of K
(the
number of neighbors).
AUROC.min
a numeric
representing the minimum accuracy
obtained by grouping all the synthetic results by super-populations, for
the specified values of D
and K
.
AUROC
a numeric
representing the accuracy obtained
by grouping all the synthetic results for the specified values of D
and K
.
Accu.CM
a numeric
representing the value of accuracy
of the confusion matrix obtained by grouping all the synthetic results for
the specified values of D
and K
.
dfAUROC
a data.frame
the summary of the results by
super-population. The data.frame
contains
those columns:
D
a numeric
representing the value of D
(the
number of dimensions).
K
a numeric
representing the value of K
(the
number of neighbors).
Call
a character
string representing the
super-population.
L
a numeric
representing the lower value of the 95%
confidence interval for the AUROC obtained for the fixed values of
super-population, D
and K
.
AUROC
a numeric
representing the AUROC obtained for the
fixed values of super-population, D
and K
.
H
a numeric
representing the higher value of the 95%
confidence interval for the AUROC obtained for the fixed values of
super-population, D
and K
.
D
a numeric
representing the optimal D
value
(the number of dimensions) for the specific profile.
K
a numeric
representing the optimal K
value
(the number of neighbors) for the specific profile.
listD
a numeric
representing the optimal D
values (the number of dimensions) for the specific profile. More than one
D
is possible.
KNNSample
a data.frame
containing the inferred ancestry
for different values of K
and D
. The data.frame
contains those columns:
sample.id
a character
string representing the unique
identifier of the current profile.
D
a numeric
representing the value of D
(the
number of dimensions) used to infer the ancestry.
K
a numeric
representing the value of K
(the
number of neighbors) used to infer the ancestry.
SuperPop
a character
string representing the inferred
ancestry for the specified D
and K
values.
KNNSynthetic
a data.frame
containing the inferred ancestry
for each synthetic data for different values of K
and D
.
The data.frame
contains those columns: "sample.id", "D", "K", "infer.superPop", "ref.superPop"
sample.id
a character
string representing the unique
identifier of the current synthetic data.
D
a numeric
representing the value of D
(the
number of dimensions) used to infer the ancestry.
K
a numeric
representing the value of K
(the
number of neighbors) used to infer the ancestry.
infer.superPop
a character
string representing the inferred
ancestry for the specified D
and K
values.
ref.superPop
a character
string representing the known
ancestry from the reference
Ancestry
a data.frame
containing the inferred
ancestry for the current profile. The data.frame
contains those
columns:
sample.id
a character
string representing the unique
identifier of the current profile.
D
a numeric
representing the value of D
(the
number of dimensions) used to infer the ancestry.
K
a numeric
representing the value of K
(the
number of neighbors) used to infer the ancestry.
SuperPop
a character
string representing the inferred
ancestry.
Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.
## Required library for GDS
library(SNPRelate)
## Path to the demo 1KG GDS file is located in this package
dataDir <- system.file("extdata", package="RAIDS")
#################################################################
## The 1KG GDS file and the 1KG SNV Annotation GDS file
## need to be located in the same directory
## Note that the 1KG GDS file used for this example is a
## simplified version and CANNOT be used for any real analysis
#################################################################
path1KG <- file.path(dataDir, "tests")
fileReferenceGDS <- file.path(path1KG, "ex1_good_small_1KG.gds")
fileAnnotGDS <- file.path(path1KG, "ex1_good_small_1KG_Annot.gds")
#################################################################
## The Sample SNP pileup files (one per sample) need
## to be located in the same directory.
#################################################################
demoProfileEx1 <- file.path(dataDir, "example", "snpPileup", "ex1.txt.gz")
#################################################################
## The path where the Profile GDS Files (one per sample)
## will be created need to be specified.
#################################################################
pathProfileGDS <- file.path(tempdir(), "out.tmp")
####################################################################
## Fix seed to ensure reproducible results
####################################################################
set.seed(3043)
gds1KG <- snpgdsOpen(fileReferenceGDS)
dataRef <- select1KGPop(gds1KG, nbProfiles=2L)
closefn.gds(gds1KG)
## Required library for this example to run correctly
if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
## Chromosome length information
## chr23 is chrX, chr24 is chrY and chrM is 25
chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25]
# \donttest{
res <- inferAncestry(profileFile=demoProfileEx1,
pathProfileGDS=pathProfileGDS,
fileReferenceGDS=fileReferenceGDS,
fileReferenceAnnotGDS=fileAnnotGDS,
chrInfo=chrInfo,
syntheticRefDF=dataRef,
genoSource="snp-pileup")
unlink(pathProfileGDS, recursive=TRUE, force=TRUE)
# }
}