Pre-process DNA copy number (CN) data for detection of CN events.

The package evaluates DNA copy number data, using both their initial form (copy number as a noisy function of genomic position) and their approximation by a piecewise-constant function (segmentation), for the purpose of identifying genomic regions where the copy number differs from the norm.

CNpreprocessing(
  segall,
  ratall = NULL,
  idCol = NULL,
  startCol = NULL,
  endCol = NULL,
  medCol = NULL,
  madCol = NULL,
  errorCol = NULL,
  chromCol = NULL,
  bpStartCol = NULL,
  bpEndCol = NULL,
  annot = NULL,
  annotStartCol = NULL,
  annotEndCol = NULL,
  annotChromCol = NULL,
  useEnd = FALSE,
  blsize = NULL,
  minJoin = NULL,
  nTrial = 10,
  bestBIC = -1e+07,
  modelNames = "E",
  cWeight = NULL,
  bsTimes = NULL,
  chromRange = NULL,
  nJobs = 1,
  normalLength = NULL,
  normalMedian = NULL,
  normalMad = NULL,
  normalError = NULL
)

Arguments

segall	a `matrix` or a `data.frame` for segmented copy number profiles. It may have a character column, with a name specified by `idCol`, and/or numeric columns with names specified by `startCol, endCol, medCol, madCol,errorCol` `,chromCol, bpStartCol, bpEndCol`. Each row of `segall` corresponds to a segment belonging to one of the profiles to be pre-processed.
ratall	a `matrix` whose rows correspond to genomic positions and columns to copy number profiles. The elements of this matrix are functions of copy number, most often log ratios of copy number to the expected standard value, such as 2 in diploid genomes.
idCol	a `character` string specifying the name for the column in `segall` tabulating the profile IDs. When not specified, the numerical column of the `ratall` object will be used as the profile IDs. Default: `NULL`.
startCol	a `character` string specifying the name of column in `segall` that tabulates the (integer) start position of each segment in internal units such as probe numbers for data of CGH microarray origin. Default: `NULL`.
endCol	a `character` string specifying the name of column in `segall` that tabulates the (integer) end position of each segment in internal units such as probe numbers for data of CGH microarray origin. Default: `NULL`.
medCol	a `character` string specifying the name of column in `segall` that, for the function of copy number used in the study (typically log ratios), tabulates the (numeric) values for the function (`medCol`), a measure of its spread (`madCol`) and its error (`errorCol`) for the segment. Default: `NULL`.
madCol	a `character` string specifying the name of column in `segall` that, for the function of copy number used in the study (typically log ratios), tabulates the (numeric) values for a measure of spread (`madCol`) related to the function (`medCol`) for the segment. Default: `NULL`.
errorCol	a `character` string specifying the name of column in `segall` that, for the function of copy number used in the study (typically log ratios), tabulates the (numeric) values for the error (`errorCol`) related to the function (`medCol`) for the segment. Default: `NULL`.
chromCol	a `character` string specifying the name for the column in `segall` tabulating the (integer) chromosome number for each segment.
bpStartCol	a `character` string specifying the name of column in `segall` that tabulates the (integer) genomic start coordinate of each segment.
bpEndCol	a `character` string specifying the name of column in `segall` that tabulates the (integer) genomic end coordinate of each segment.
annot	a `matrix` or a `data.frame` that contains the annotation for the copy number measurement platform in the study. It is generally expected to contain columns with names specified by `annotStartCol, annotEndCol, annotChromCol`.
annotStartCol	a `character` string specifying the name of column in `annot` that tabulates the (integer) genomic start coordinates in case of CGH microarrays.
annotEndCol	a `character` string specifying the name of column in `annot` that tabulates the (integer) genomic end coordinates in case of CGH microarrays.
annotChromCol	a `character` string specifying the name of column in `annot` that tabulates the chromosome number for each copy number measuring unit, such as a probe in case of CGH microarrays.
useEnd	a single logical value specifying whether the segment end positions as given by the `bpEndCol` of `segall` are to be looked up in the `annotEndCol` column of `annot` (if `useEnd=TRUE`) or in the `annotStartCol` column (default). Default: `FALSE`.
blsize	a single `integer` specifying the bootstrap sampling rate of segment medians to generate input for model-based clustering. The number of times a segment is sampled is then given by the (integer) division of the segment length in internal units by `blsize`.
minJoin	a single `numeric` value between 0 and 1 specifying the degree of overlap above which two clusters will be joined into one. Default: `NULL`. TODO= HAVE a default value, not NULL.
nTrial	a single positive `integer` specifying the number of times a model-based clustering is attempted for each profile in order to achieve the highest Bayesian information criterion (BIC). Default: `10`.
bestBIC	a single `numeric` value for initalizing the Bayesian information criterion (BIC) maximization. A large negative value is recommended. Default: `-1e7`.
modelNames	a `vector` of `character` strings specifying the names of models to be used in model-based clustering (see package `mclust` for further details). The default is `"E"`.
cWeight	A single `numeric` value between `0` and `1` specifying the minimal share of the central cluster in each profile.
bsTimes	a single positive `double` value specifying the number of time the median of each segment is sampled in order to predict the cluster assignment for the segment. Default: `NULL`. TODO: select a default value that is not null.
chromRange	a `integer` `vector` enumerating chromosomes from which segments are to be used for initial model-based clustering. Default: `NULL`.
nJobs	a single positive `integer` specifying the number of worker jobs to create in case of distributed computation. Default: `1` and always `1` for Windows.
normalLength	an integer `vector` specifying the genomic lengths of segments in the normal reference data. Default: `NULL`.
normalMedian	a numeric `vector`, of the same length as `normalLength`, specifying the segment values of the normal reference segments. Default: `NULL`.
normalMad	a numeric `vector`, of the same length as `normalLength`, specifying the value spreads of the normal reference segments. Default: `NULL`.
normalError	a numeric `vector`, of the same length as `normalLength`, specifying the error values of the normal reference segments. Default: `NULL`.

Value

The input segall data.frame to which some or all of the following columns may be bound, depending on the availability of input:

segmedian a numeric, the median function of copy number
segmad a numeric, the MAD for the function of copy number
mediandev a numeric, the median function of copy number relative to its central value
segerr a numeric, the error estimate for the function of copy number
centerz a numeric between 0 and 1, the probability that the segment is in the central cluster
marginalprob a numeric, the marginal probability for the segment in the central cluster
maxz TODO
maxzmean TODO
maxzsigma TODO
samplesize TODO
negtail the probability of finding the deviation as observed or larger in a collection of central segments
negtailnormad the probability of finding the deviation/MAD as observed or larger in a collection of central segments
negtailnormerror a numeric, the probability of finding the deviation/error as observed or larger in a collection of central segments

Details

Depending on the availability of input, the function will perform the following operations for each copy number profile.

If raw data are available in addition to segment start and end positions, median and MAD of each segment will be computed. For each profile, bootstrap sampling of the segment median values will be performed, and the sample will be used to estimate the error in the median for each segment. Model-dependent clustering (fitting to a gaussian mixture) of the sample will be performed. The central cluster (the one nearest the expected unaltered value) will be identified and, if necessary, merged with adjacent clusters in order to comprise the minimal required fraction of the data. Deviation of each segment from the center, its probability to belong to the central cluster and its marginal probability in the central cluster will be computed.

If segment medians or median deviations are available or have been computed, and, in addition, genomic lengths and average values are given for a collection of segments with unaltered copy number, additional estimates will be performed. If median values are available for the unaltered segments, the marginal probability of the observed median or median deviation in the unaltered set will be computed for each segment. Likewise, marginal probabilities for median/MAD and/or median/error will be computed if these statistics are available.

Author

Alexander Krasnitz

Pre-process DNA copy number (CN) data for detection of CN events.

Arguments

Value

Details

Author

Examples