This function runs most steps leading to the ancestry inference call on a specific RNA profile. First, the function creates the Profile GDS file for the specific profile using the information from a RDS Sample description file and the Population Reference GDS file.

inferAncestryGeneAware(
  profileFile,
  pathProfileGDS,
  fileReferenceGDS,
  fileReferenceAnnotGDS,
  chrInfo,
  syntheticRefDF,
  genoSource = c("snp-pileup", "generic", "VCF", "bam"),
  np = 1L,
  blockTypeID,
  verbose = FALSE
)

Arguments

profileFile

a character string representing the path and the file name of the genotype file or the bam if genoSource is snp-pileup the fine extension must be .txt.gz, if VCF the extension must be .vcf.gz

pathProfileGDS

a character string representing the path to the directory where the GDS Profile files will be created. Default: NULL.

fileReferenceGDS

a character string representing the file name of the Population Reference GDS file. The file must exist.

fileReferenceAnnotGDS

a character string representing the file name of the Population Reference GDS Annotation file. The file must exist.

chrInfo

a vector of positive integer values representing the length of the chromosomes. See 'details' section.

syntheticRefDF

a data.frame containing a subset of reference profiles for each sub-population present in the Reference GDS file. The data.frame must have those columns:

sample.id

a character string representing the sample identifier.

pop.group

a character string representing the subcontinental population assigned to the sample.

superPop

a character string representing the super-population assigned to the sample.

genoSource

a character string with four possible values: 'snp-pileup', 'generic', 'VCF' or 'bam'. It specifies if the genotype files are generated by snp-pileup (Facets) or are a generic format CSV file with at least those columns: 'Chromosome', 'Position', 'Ref', 'Alt', 'Count', 'File1R' and 'File1A'. The 'Count' is the depth at the specified position; 'FileR' is the depth of the reference allele and 'File1A' is the depth of the specific alternative allele. Finally the file can be a VCF file with at least those genotype fields: GT, AD, DP.

np

a single positive integer specifying the number of threads to be used. Default: 1L.

blockTypeID

a character string corresponding to the block type used to extract the block identifiers. The block type must be present in the GDS Reference Annotation file.

verbose

a logical indicating if messages should be printed to show how the different steps in the function. Default: FALSE.

Value

a list containing 4 entries:

pcaSample

a list containing the information related to the eigenvectors. The list contains those 3 entries:

sample.id

a character string representing the unique identifier of the current profile.

eigenvector.ref

a matrix of numeric containing the eigenvectors for the reference profiles.

eigenvector

a matrix of numeric containing the eigenvectors for the current profile projected on the PCA from the reference profiles.

paraSample

a list containing the results with different D and K values that lead to optimal parameter selection. The list contains those entries:

dfPCA

a data.frame containing statistical results on all combined synthetic results done with a fixed value of D (the number of dimensions). The data.frame contains those columns:

D

a numeric representing the value of D (the number of dimensions).

median

a numeric representing the median of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.

mad

a numeric representing the MAD of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.

upQuartile

a numeric representing the upper quartile of the minimum AUROC obtained (within super populations) for all combination of the fixed D value and all tested K values.

k

a numeric representing the optimal K value (the number of neighbors) for a fixed D value.

dfPop

a data.frame containing statistical results on all combined synthetic results done with different values of D (the number of dimensions) and K (the number of neighbors). The data.frame contains those columns:

D

a numeric representing the value of D (the number of dimensions).

K

a numeric representing the value of K (the number of neighbors).

AUROC.min

a numeric representing the minimum accuracy obtained by grouping all the synthetic results by super-populations, for the specified values of D and K.

AUROC

a numeric representing the accuracy obtained by grouping all the synthetic results for the specified values of D and K.

Accu.CM

a numeric representing the value of accuracy of the confusion matrix obtained by grouping all the synthetic results for the specified values of D and K.

dfAUROC

a data.frame the summary of the results by super-population. The data.frame contains those columns:

D

a numeric representing the value of D (the number of dimensions).

K

a numeric representing the value of K (the number of neighbors).

Call

a character string representing the super-population.

L

a numeric representing the lower value of the 95% confidence interval for the AUROC obtained for the fixed values of super-population, D and K.

AUROC

a numeric representing the AUROC obtained for the fixed values of super-population, D and K.

H

a numeric representing the higher value of the 95% confidence interval for the AUROC obtained for the fixed values of super-population, D and K.

D

a numeric representing the optimal D value (the number of dimensions) for the specific profile.

K

a numeric representing the optimal K value (the number of neighbors) for the specific profile.

listD

a numeric representing the optimal D values (the number of dimensions) for the specific profile. More than one D is possible.

KNNSample

a data.frame containing the inferred ancestry for different values of K and D. The data.frame contains those columns:

sample.id

a character string representing the unique identifier of the current profile.

D

a numeric representing the value of D (the number of dimensions) used to infer the ancestry.

K

a numeric representing the value of K (the number of neighbors) used to infer the ancestry.

SuperPop

a character string representing the inferred ancestry for the specified D and K values.

KNNSynthetic

a data.frame containing the inferred ancestry for each synthetic data for different values of K and D. The data.frame contains those columns:

sample.id

a character string representing the unique identifier of the current synthetic data.

D

a numeric representing the value of D (the number of dimensions) used to infer the ancestry.

K

a numeric representing the value of K (the number of neighbors) used to infer the ancestry.

infer.superPop

a character string representing the inferred ancestry for the specified D and K values.

ref.superPop

a character string representing the known ancestry from the reference

Ancestry

a data.frame containing the inferred ancestry for the current profile. The data.frame contains those columns:

sample.id

a character string representing the unique identifier of the current profile.

D

a numeric representing the value of D (the number of dimensions) used to infer the ancestry.

K

a numeric representing the value of K (the number of neighbors) used to infer the ancestry.

SuperPop

a character string representing the inferred ancestry.

Details

The runExomeAncestry() function generates 3 types of files in the OUTPUT directory.

Ancestry Inference

The ancestry inference CSV file (".Ancestry.csv" file)

Inference Informaton

The inference information RDS file (".infoCall.rds" file)

Synthetic Information

The parameter information RDS files from the synthetic inference ("KNN.synt.*.rds" files in a sub-directory)

In addition, a sub-directory (named using the profile ID) is also created.

References

Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.

Author

Pascal Belleau, Astrid Deschênes and Alexander Krasnitz

Examples


## Required library for GDS
library(SNPRelate)

## Path to the demo 1KG GDS file is located in this package
dataDir <- system.file("extdata", package="RAIDS")


#################################################################
## The 1KG GDS file and the 1KG SNV Annotation GDS file
## need to be located in the same directory
## Note that the 1KG GDS file used for this example is a
## simplified version and CANNOT be used for any real analysis
#################################################################
path1KG <- file.path(dataDir, "tests")

fileReferenceGDS <- file.path(path1KG, "ex1_good_small_1KG.gds")
fileAnnotGDS <- file.path(path1KG, "ex1_good_small_1KG_Annot.gds")

#################################################################
## The Sample SNP pileup files (one per sample) need
## to be located in the same directory.
#################################################################
demoProfileEx1 <- file.path(dataDir, "example", "snpPileup", "ex1.txt.gz")

#################################################################
## The path where the Profile GDS Files (one per sample)
## will be created need to be specified.
#################################################################
pathProfileGDS <- file.path(tempdir(), "out.tmp")

####################################################################
## Fix seed to ensure reproducible results
####################################################################
set.seed(3043)

gds1KG <- snpgdsOpen(fileReferenceGDS)
dataRef <- select1KGPop(gds1KG, nbProfiles=2L)
closefn.gds(gds1KG)

## Required library for this example to run correctly
if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
     requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {

    ## Chromosome length information
    ## chr23 is chrX, chr24 is chrY and chrM is 25
    chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25]

    # \donttest{

        res <- inferAncestryGeneAware(profileFile=demoProfileEx1,
            pathProfileGDS=pathProfileGDS,
            fileReferenceGDS=fileReferenceGDS,
            fileReferenceAnnotGDS=fileAnnotGDS,
            chrInfo=chrInfo,
            syntheticRefDF=dataRef,
            blockTypeID="GeneS.Ensembl.Hsapiens.v86",
            genoSource="snp-pileup")

        unlink(pathProfileGDS, recursive=TRUE, force=TRUE)

    # }
}