R/processStudy.R
    inferAncestry.RdThis function runs most steps leading to the ancestry inference call on a specific RNA profile. First, the function creates the Profile GDS file for the specific profile using the information from a RDS Sample description file and the Population Reference GDS file.
inferAncestry(
  profileFile,
  pathProfileGDS,
  fileReferenceGDS,
  fileReferenceAnnotGDS,
  chrInfo,
  syntheticRefDF,
  genoSource = c("snp-pileup", "generic", "VCF", "bam"),
  np = 1L,
  verbose = FALSE
)a character string representing the path and the
file name of the genotype file or the bam if genoSource is snp-pileup the
fine extension must be .txt.gz, if VCF the extension must be .vcf.gz
a character string representing the path to
the directory where the GDS Profile files will be created.
Default: NULL.
a character string representing the file
name of the Population Reference GDS file. The file must exist.
a character string representing the
file name of the Population Reference GDS Annotation file. The file
must exist.
a vector of positive integer values
representing the length of the chromosomes. See 'details' section.
a data.frame containing a subset of
reference profiles for each sub-population present in the Reference GDS
file. The data.frame must have those columns:
a character string representing the sample
identifier.
a character string representing the
subcontinental population assigned to the sample.
a character string representing the
super-population assigned to the sample.
a character string with four possible values:
'snp-pileup', 'generic', 'VCF' or 'bam'. It specifies if the genotype files
are generated by snp-pileup (Facets) or are a generic format CSV file
with at least those columns:
'Chromosome', 'Position', 'Ref', 'Alt', 'Count', 'File1R' and 'File1A'.
The 'Count' is the depth at the specified position;
'FileR' is the depth of the reference allele and
'File1A' is the depth of the specific alternative allele.
Finally the file can be a VCF file with at least those genotype
fields: GT, AD, DP.
a single positive integer specifying the number of
threads to be used. Default: 1L.
a logical indicating if messages should be printed
to show how the different steps in the function. Default: FALSE.
a list containing 4 entries:
pcaSamplea list containing the information related
to the eigenvectors. The list contains those 3 entries:
sample.ida character string representing the unique
identifier of the current profile.
eigenvector.refa matrix of numeric containing
the eigenvectors for the reference profiles.
eigenvectora matrix of numeric containing the
eigenvectors for the current profile projected on the PCA from the
reference profiles.
paraSamplea list containing the results with
different D and K values that lead to optimal parameter
selection. The list contains those entries:
dfPCAa data.frame containing statistical results
on all combined synthetic results done with a fixed value of D (the
number of dimensions). The data.frame contains those columns:
Da numeric representing the value of D (the
number of dimensions).
mediana numeric representing the median of the
minimum AUROC obtained (within super populations) for all combination of
the fixed D value and all tested K values.
mada numeric representing the MAD of the minimum
AUROC obtained (within super populations) for all combination of the fixed
D value and all tested K values.
upQuartilea numeric representing the upper quartile
of the minimum AUROC obtained (within super populations) for all
combination of the fixed D value and all tested K values.
ka numeric representing the optimal K value
(the number of neighbors) for a fixed D value.
dfPopa data.frame containing statistical results on
all combined synthetic results done with different values of D (the
number of dimensions) and K (the number of neighbors).
The data.frame contains those columns:
Da numeric representing the value of D (the
number of dimensions).
Ka numeric representing the value of K (the
number of neighbors).
AUROC.mina numeric representing the minimum accuracy
obtained by grouping all the synthetic results by super-populations, for
the specified values of D and K.
AUROCa numeric representing the accuracy obtained
by grouping all the synthetic results for the specified values of D
and K.
Accu.CMa numeric representing the value of accuracy
of the confusion matrix obtained by grouping all the synthetic results for
the specified values of D and K.
dfAUROCa data.frame the summary of the results by
super-population. The data.frame contains
those columns:
Da numeric representing the value of D (the
number of dimensions).
Ka numeric representing the value of K (the
number of neighbors).
Calla character string representing the
super-population.
La numeric representing the lower value of the 95%
confidence interval for the AUROC obtained for the fixed values of
super-population, D and K.
AUROCa numeric representing  the AUROC obtained for the
fixed values of super-population, D and K.
Ha numeric representing the higher value of the 95%
confidence interval for the AUROC obtained for the fixed values of
super-population, D and K.
Da numeric representing the optimal D value
(the number of dimensions) for the specific profile.
Ka numeric representing the optimal K value
(the number of neighbors) for the specific profile.
listDa numeric representing the optimal D
values (the number of dimensions) for the specific profile. More than one
D is possible.
KNNSamplea data.frame containing the inferred ancestry
for different values of K and D. The data.frame
contains those columns:
sample.ida character string representing the unique
identifier of the current profile.
Da numeric representing the value of D (the
number of dimensions) used to infer the ancestry.
Ka numeric representing the value of K (the
number of neighbors) used to infer the ancestry.
SuperPopa character string representing the inferred
ancestry for the specified D and K values.
KNNSynthetica data.frame containing the inferred ancestry
for each synthetic data for different values of K and D.
The data.frame
contains those columns: "sample.id", "D", "K", "infer.superPop", "ref.superPop"
sample.ida character string representing the unique
identifier of the current synthetic data.
Da numeric representing the value of D (the
number of dimensions) used to infer the ancestry.
Ka numeric representing the value of K (the
number of neighbors) used to infer the ancestry.
infer.superPopa character string representing the inferred
ancestry for the specified D and K values.
ref.superPopa character string representing the known
ancestry from the reference
Ancestrya data.frame containing the inferred
ancestry for the current profile. The data.frame contains those
columns:
sample.ida character string representing the unique
identifier of the current profile.
Da numeric representing the value of D (the
number of dimensions) used to infer the ancestry.
Ka numeric representing the value of K (the
number of neighbors) used to infer the ancestry.
SuperPopa character string representing the inferred
ancestry.
Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.
## Required library for GDS
library(SNPRelate)
## Path to the demo 1KG GDS file is located in this package
dataDir <- system.file("extdata", package="RAIDS")
#################################################################
## The 1KG GDS file and the 1KG SNV Annotation GDS file
## need to be located in the same directory
## Note that the 1KG GDS file used for this example is a
## simplified version and CANNOT be used for any real analysis
#################################################################
path1KG <- file.path(dataDir, "tests")
fileReferenceGDS <- file.path(path1KG, "ex1_good_small_1KG.gds")
fileAnnotGDS <- file.path(path1KG, "ex1_good_small_1KG_Annot.gds")
#################################################################
## The Sample SNP pileup files (one per sample) need
## to be located in the same directory.
#################################################################
demoProfileEx1 <- file.path(dataDir, "example", "snpPileup", "ex1.txt.gz")
#################################################################
## The path where the Profile GDS Files (one per sample)
## will be created need to be specified.
#################################################################
pathProfileGDS <- file.path(tempdir(), "out.tmp")
####################################################################
## Fix seed to ensure reproducible results
####################################################################
set.seed(3043)
gds1KG <- snpgdsOpen(fileReferenceGDS)
dataRef <- select1KGPop(gds1KG, nbProfiles=2L)
closefn.gds(gds1KG)
## Required library for this example to run correctly
if (requireNamespace("Seqinfo", quietly=TRUE) &&
     requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
    ## Chromosome length information
    ## chr23 is chrX, chr24 is chrY and chrM is 25
    chrInfo <- Seqinfo::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25]
    # \donttest{
        res <- inferAncestry(profileFile=demoProfileEx1,
            pathProfileGDS=pathProfileGDS,
            fileReferenceGDS=fileReferenceGDS,
            fileReferenceAnnotGDS=fileAnnotGDS,
            chrInfo=chrInfo,
            syntheticRefDF=dataRef,
            genoSource="snp-pileup")
        unlink(pathProfileGDS, recursive=TRUE, force=TRUE)
    # }
}