Run most steps leading to the ancestry inference call on a specific DNA profile

This function runs most steps leading to the ancestry inference call on a specific RNA profile. First, the function creates the Profile GDS file for the specific profile using the information from a RDS Sample description file and the Population Reference GDS file.

inferAncestry(
  profileFile,
  pathProfileGDS,
  fileReferenceGDS,
  fileReferenceAnnotGDS,
  chrInfo,
  syntheticRefDF,
  genoSource = c("snp-pileup", "generic", "VCF", "bam"),
  np = 1L,
  verbose = FALSE
)

Arguments

profileFile

a character string representing the path and the file name of the genotype file or the bam if genoSource is snp-pileup the fine extension must be .txt.gz, if VCF the extension must be .vcf.gz

pathProfileGDS

a character string representing the path to the directory where the GDS Profile files will be created. Default: NULL.

fileReferenceGDS

a character string representing the file name of the Population Reference GDS file. The file must exist.

fileReferenceAnnotGDS

a character string representing the file name of the Population Reference GDS Annotation file. The file must exist.

chrInfo

a vector of positive integer values representing the length of the chromosomes. See 'details' section.

syntheticRefDF

a data.frame containing a subset of reference profiles for each sub-population present in the Reference GDS file. The data.frame must have those columns:

sample.id: a character string representing the sample identifier.
pop.group: a character string representing the subcontinental population assigned to the sample.
superPop: a character string representing the super-population assigned to the sample.

genoSource

a character string with four possible values: 'snp-pileup', 'generic', 'VCF' or 'bam'. It specifies if the genotype files are generated by snp-pileup (Facets) or are a generic format CSV file with at least those columns: 'Chromosome', 'Position', 'Ref', 'Alt', 'Count', 'File1R' and 'File1A'. The 'Count' is the depth at the specified position; 'FileR' is the depth of the reference allele and 'File1A' is the depth of the specific alternative allele. Finally the file can be a VCF file with at least those genotype fields: GT, AD, DP.

np

a single positive integer specifying the number of threads to be used. Default: 1L.

verbose

a logical indicating if messages should be printed to show how the different steps in the function. Default: FALSE.

Value

a list containing 4 entries:

pcaSample

a list containing the information related to the eigenvectors. The list contains those 3 entries:

sample.id: a character string representing the unique identifier of the current profile.
eigenvector.ref: a matrix of numeric containing the eigenvectors for the reference profiles.
eigenvector: a matrix of numeric containing the eigenvectors for the current profile projected on the PCA from the reference profiles.

paraSample

a list containing the results with different D and K values that lead to optimal parameter selection. The list contains those entries:

dfPCA

a data.frame containing statistical results on all combined synthetic results done with a fixed value of D (the number of dimensions). The data.frame contains those columns:

D: a numeric representing the value of D (the number of dimensions).
median: a numeric representing the median of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.
mad: a numeric representing the MAD of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.
upQuartile: a numeric representing the upper quartile of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.
k: a numeric representing the optimal K value (the number of neighbors) for a fixed D value.

dfPop

a data.frame containing statistical results on all combined synthetic results done with different values of D (the number of dimensions) and K (the number of neighbors). The data.frame contains those columns:

D: a numeric representing the value of D (the number of dimensions).
K: a numeric representing the value of K (the number of neighbors).
AUROC.min: a numeric representing the minimum accuracy obtained by grouping all the synthetic results by super-populations, for the specified values of D and K.
AUROC: a numeric representing the accuracy obtained by grouping all the synthetic results for the specified values of D and K.
Accu.CM: a numeric representing the value of accuracy of the confusion matrix obtained by grouping all the synthetic results for the specified values of D and K.

dfAUROC

a data.frame the summary of the results by super-population. The data.frame contains those columns:

D: a numeric representing the value of D (the number of dimensions).
K: a numeric representing the value of K (the number of neighbors).
Call: a character string representing the super-population.
L: a numeric representing the lower value of the 95% confidence interval for the AUROC obtained for the fixed values of super-population, D and K.
AUROC: a numeric representing the AUROC obtained for the fixed values of super-population, D and K.
H: a numeric representing the higher value of the 95% confidence interval for the AUROC obtained for the fixed values of super-population, D and K.

D

a numeric representing the optimal D value (the number of dimensions) for the specific profile.

K

a numeric representing the optimal K value (the number of neighbors) for the specific profile.

listD

a numeric representing the optimal D values (the number of dimensions) for the specific profile. More than one D is possible.

KNNSample

a data.frame containing the inferred ancestry for different values of K and D. The data.frame contains those columns:

sample.id: a character string representing the unique identifier of the current profile.
D: a numeric representing the value of D (the number of dimensions) used to infer the ancestry.
K: a numeric representing the value of K (the number of neighbors) used to infer the ancestry.
SuperPop: a character string representing the inferred ancestry for the specified D and K values.

KNNSynthetic

a data.frame containing the inferred ancestry for each synthetic data for different values of K and D. The data.frame contains those columns: "sample.id", "D", "K", "infer.superPop", "ref.superPop"

sample.id: a character string representing the unique identifier of the current synthetic data.
D: a numeric representing the value of D (the number of dimensions) used to infer the ancestry.
K: a numeric representing the value of K (the number of neighbors) used to infer the ancestry.
infer.superPop: a character string representing the inferred ancestry for the specified D and K values.
ref.superPop: a character string representing the known ancestry from the reference

Ancestry

a data.frame containing the inferred ancestry for the current profile. The data.frame contains those columns:

sample.id: a character string representing the unique identifier of the current profile.
D: a numeric representing the value of D (the number of dimensions) used to infer the ancestry.
K: a numeric representing the value of K (the number of neighbors) used to infer the ancestry.
SuperPop: a character string representing the inferred ancestry.

References

Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.

Author

Pascal Belleau, Astrid Deschênes and Alexander Krasnitz

Examples


## Required library for GDS
library(SNPRelate)

## Path to the demo 1KG GDS file is located in this package
dataDir <- system.file("extdata", package="RAIDS")

#################################################################
## The 1KG GDS file and the 1KG SNV Annotation GDS file
## need to be located in the same directory
## Note that the 1KG GDS file used for this example is a
## simplified version and CANNOT be used for any real analysis
#################################################################
path1KG <- file.path(dataDir, "tests")

fileReferenceGDS <- file.path(path1KG, "ex1_good_small_1KG.gds")
fileAnnotGDS <- file.path(path1KG, "ex1_good_small_1KG_Annot.gds")

#################################################################
## The Sample SNP pileup files (one per sample) need
## to be located in the same directory.
#################################################################
demoProfileEx1 <- file.path(dataDir, "example", "snpPileup", "ex1.txt.gz")

#################################################################
## The path where the Profile GDS Files (one per sample)
## will be created need to be specified.
#################################################################
pathProfileGDS <- file.path(tempdir(), "out.tmp")

####################################################################
## Fix seed to ensure reproducible results
####################################################################
set.seed(3043)

gds1KG <- snpgdsOpen(fileReferenceGDS)
dataRef <- select1KGPop(gds1KG, nbProfiles=2L)
closefn.gds(gds1KG)

## Required library for this example to run correctly
if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
     requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {

    ## Chromosome length information
    ## chr23 is chrX, chr24 is chrY and chrM is 25
    chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25]

    # \donttest{

        res <- inferAncestry(profileFile=demoProfileEx1,
            pathProfileGDS=pathProfileGDS,
            fileReferenceGDS=fileReferenceGDS,
            fileReferenceAnnotGDS=fileAnnotGDS,
            chrInfo=chrInfo,
            syntheticRefDF=dataRef,
            genoSource="snp-pileup")

        unlink(pathProfileGDS, recursive=TRUE, force=TRUE)

    # }
}