Usage
This program should be used only for inference on case-control designs with independent (i.e., unrelated) subjects of homogeneous ancestry and biallelic variants.
Input Files
The program takes as input a tab-delimited text file containing column names in the first row and data for independent cases and controls in the following rows. The first column must contain a case indicator (1 = case/0 = control), and all subsequent columns must contain as many tab-separated genotype indicators (coded as 0, 1, or 2 minor alleles, as determined by the combined sample allele frequency) as there are names for columns other than the case indicator. Missing genotypes must be coded as -99. Below is an example of an input file:
caseind geno1 geno2 geno3 geno4 geno5 geno6 geno7 geno8
0 0 0 0 0 0 0 0 2
0 0 0 0 0 -99 0 0 0
1 0 -99 0 0 0 0 1 1
1 -99 0 -99 0 2 1 0 -99
0 0 1 0 0 0 1 0 0
1 1 0 0 2 0 0 0 1
1 0 1 2 0 0 0 0 1
0 2 1 1 0 0 0 -99 1
0 0 0 0 1 0 0 0 0
See the example*.dat
files in the examples
subdirectory of the tarfile for additional illustrations.
The observations for cases and controls can appear in any order in the input file, but changing this order between runs on the same data set may change the permutation p-values slightly even when the same random number seed is used. None of the observed test statistics or CMC results will change, but, because the permutation procedure shuffles the case status vector from its inputted order, it generates a unique set of shufflings (and sample from the permutation null distribution) for each random number seed and initial ordering of the case status vector.
All columns in the input file are treated as a single group of variants over which the user wishes to perform inference. Thus, if a user wishes to perform inference on a single gene, the input file must contain variants from only that gene. Alternatively, if a user wants to perform inference over an entire pathway, the input file must contain variants from only that pathway.
For all tests, any variant with only a single genotype in the sample or completely missing genotypes is eliminated from consideration. The number of polymorphic variants, defined as those having 2 or more different genotypes in the sample, is reported by the software with the output of each test for convenience. Although RVASSOC will automatically exclude variants that are monomorphic in the input file from analysis, we highly recommend that the input file include only variants that are polymorphic (i.e., have two or three distinct genotypes in the sample) and individuals with genotype data for at least one of these variants. When performing inference on small groups of variants, not following this recommendation increases the risk of including individuals with observed genotypes at only monomorphic variants. Such individuals will have completely missing genotypes after monomorphic variants are excluded from analysis, which may cause the results given by the CA max, CA sum, and WSS tests to differ from what they would have been had these individuals been excluded a priori. However, because they have some genotype data in the input file, these individuals will not trigger a warning message from the program.
Command Line
The RVASSOC command line has the following format, where <argument>
denotes a required argument supplied by the user and [-flag <value>]
are optional flags with required values specified by the user that alter the operation of the program:
rvassoc <filename> [-seed <value>] [-perms <value>] [-cmccolmaf <value>]
[-cmcswptol <value>] [-buffsize <value>] [-fceps <value>]
Not specifying an optional flag has the same effect as specifying it with its default value. Brief descriptions of the arguments and flags, including default values, are provided below and in the online usage message printed when calling rvassoc
without any arguments. Additional information is provided in Details.
Required Arguments
filename
: The path/name of the input file.
Optional Flags [Default Values]
-seed [314159265]
: Positive integer random number seed.
-perms [10000]
: Number of permutations.
-cmccolmaf [0.01]
: Variants with MAF ≤ cmcolmaf
are collapsed in the CMC analysis.
-cmcswptol [1e-8]
: Tolerance for diagonal elements in the covariance matrix below which the corresponding column is declared linearly dependent on previously swept columns by G2SWEEP in the CMC analysis.
-buffsize [1048576]
: Length of line buffer for input file in bytes (i.e., number of characters).
-fceps [1e-10]
: Relative epsilon for comparing two floating-point numbers using Knuth’s algorithm.
Examples
There are 3 examples provided for users to test their compilations versus our original compilation and familiarize themselves with the software. The example data files are located in the examples
subdirectory of the distribution tarfile and named example*.dat
. When analyzed with the following RVASSOC calls, output identical to that in the example*.rvassoc.out
files should be printed to STDOUT:
rvassoc example1.dat -cmccolmaf 0.02
rvassoc example2.dat -cmccolmaf 0.005
rvassoc example3.dat
The resulting output for a new compilation can be redirected to a file and compared to the output from our original compilation using a utility such as diff
. The following is an example of the output produced by the call rvassoc example2.dat -cmccolmaf 0.005
The contents of the output are explained in greater detail in the Details section.
-------------------------------------------------------------------
RARE VARIANT ASSOCIATION ANALYSIS
Version: 1.12
Copyright (C) 2010-12 University of Miami Miller School of Medicine
This program is distributed under the GNU General Public
License version 3. It comes with ABSOLUTELY NO WARRANTY.
-------------------------------------------------------------------
Input File: example2.dat
Case Indicator Variable: caseind
Genotype Variables:
geno1 geno2 geno3 geno4 geno5 geno6 geno7 geno8 geno9 geno10 geno11 geno12
geno13 geno14 geno15 geno16 geno17 geno18 geno19 geno20 geno21 geno22
Total Variants: 22
--------------------------------
RARE VARIANT ASSOCIATION RESULTS
--------------------------------
Random Number Seed: 314159265
Cases/Controls: 97/412
Average Per-Individual Call Rate: 0.897446
FP Equality Relative Epsilon: 1e-10
Cochran-Armitage Permutation Max X2 Test:
Number of Polymorphic Variants: 20
Permutations: 10000
Observed Max X2: 8.1914 P: 0.151285 P(tied): 0.00309969
Cochran-Armitage Permutation Sum X2 Test:
Number of Polymorphic Variants: 20
Permutations: 10000
Observed Sum X2: 24.7015 P: 0.213279 P(tied): 9.999e-05
WSS Permutation Test:
Number of Polymorphic Variants: 20
Permutations: 10000
Observed W: 25413.5 P: 0.510549 P(tied): 0.00049995
CMC Hotelling's T2 Test:
Number of Polymorphic Variants: 20
Rare Variants Collapsed if MAF <= 0.005
Number of Common Variants: 10
Number of Complete Individuals: 64
G2SWEEP Tolerance: 1e-08
F(7,56): 0.841933 P: 0.557553
Details
We recommend the CA max and sum tests over the others implemented in RVASSOC.
The CA max/sum tests base inferences on the permutation null distributions of the maximum/sum over single-variant Cochran-Armitage chi-square statistics in the group of variants under consideration. Kinnamon et al. (2012) found that these tests had power comparable to or greater than the CMC and WSS tests in the presence of extensive neutral variation and missing genotypes and showed that the CA sum test was closely related to the Sequence Kernal Association Test (SKAT). Therefore, we recommend the use of these tests over the others implemented in RVASSOC.
Letting $Q_t$ denote the value of the summary statistic in permutation $t$ and $Q_{obs}$ denote the observed value in the sample, the two-sided p-value for the CA max/sum test is estimated from $m$ permutations by:
$$\hat{p}=\frac{\#(Q_t \ge Q_{obs})+1}{m+1}$$
This p-value follows P:
in the output. The program also estimates the permutation null distribution probability mass function at $Q_{obs}$ by:
$$\hat{p}_{tied}=\frac{\#(Q_t = Q_{obs})+1}{m+1}$$
This probability mass follows P(tied):
in the output. High values of $\hat{p}_{tied}$ are indicative of a highly discrete permutation null distribution.
Because two floating-point representations of the same number may differ due to rounding and truncation errors, all comparisons of test statistics in the above calculations are performed using Knuth’s algorithm, as implemented in the GSL function gsl_fcmp
. Briefly, Knuth’s algorithm returns:
$$\begin{array}{rcl} Q_t \lt Q_{obs} & \mathrm{if} & Q_t-Q_{obs} \lt -2^{k} \epsilon \\
Q_t \gt Q_{obs} & \mathrm{if} & Q_t-Q_{obs} \gt 2^{k} \epsilon \\
Q_t = Q_{obs} & \mathrm{if} & |Q_t-Q_{obs}| \le 2^{k} \epsilon \end{array}$$
In these formulas, $k$ is the larger of the base-2 exponents of $Q_t$ and $Q_{obs}$, and $\epsilon$ is the floating-point equality relative epsilon set by the flag -fceps
. The CA max/sum tests use all available genotype data at every variant. The average per-individual percentage of available genotypes is reported by the software. In cases in which there are as many or more individuals with missing genotypes at a variant in the entire sample than there are individuals in either the case or control groups, the CA statistic may not be calculable at that variant in the observed sample or in one or more permutations. If this occurs, RVASSOC issues a warning message to STDERR, and the analysis should be rerun with the offending variants excluded. Make sure to keep the STDERR so that important error and warning messages are not lost!
The CMC test (Li and Leal 2008) collapses variants having overall sample MAF (i.e., MAF estimate from both cases and controls) ≤ the value specified by -cmccolmaf
into a single presence/absence indicator in the analysis. An approximate $F$ test based Hotelling’s $T^2$ is used for inference on the vector comprising this indicator and minor allele counts at the common variants. This vector has a number of elements equaling the number of common variants plus 1.
One issue not considered in the original paper is that linkage disequilibrium (LD) among common variants can reduce the number of linearly independent elements in this random vector, which leads to a singular covariance matrix for which a standard inverse does not exist. However, calculating Hotelling’s $T^2$ statistic with any generalized inverse is equivalent to calculating the statistic with a standard inverse on a full-rank subset of linearly independent common variants (Kinnamon et al. 2012). Goodnight (1979) provides an algorithm for automatically calculating a $g2$ generalized inverse and the dimension of the full-rank subset without any prior knowledge of the full-rank subset. The algorithm involves applying the G2SWEEP operator once to each of the columns of the covariance matrix in succession. This operator zeros the rows and columns corresponding to common variants that are linearly dependent on the previous common variant minor allele counts and/or the rare variant indicator. Linear dependence is determined numerically using a relative tolerance specified by the flag -cmcswptol
. The effective number of linearly independent vector elements, $v$, is then automatically obtained by subtracting the number of columns that are zeroed from the total number of columns in the covariance matrix. The p-value is then calculated using the approximate $F(v,N-v-1)$ distribution of the appropriately scaled Hotelling’s $T^2$ statistic calculated using the $g2$ generalized inverse of the covariance matrix, where $N$ is the total sample size.
Only individuals with complete genotype data at common variants can be used in calculating Hotelling’s $T^2$. Provided genotype data at all common variants are complete, individuals with missing genotype data at rare variants can be used if at least one minor allele is present for a variant with a non-missing genotype because the coding of the rare variant indicator would be 1 regardless of the other variant genotypes. However, if no minor allele is present at any variant with non-missing genotypes, the coding of the rare variant is ambiguous because it would depend on the values of the unobserved genotypes. Therefore, such individuals also have to be excluded from calculating Hotelling’s $T^2$. The software outputs the number of individuals with complete data who were used in the CMC. If there are not at least 2 cases, 2 controls, and 1 denominator degree of freedom (ddf), the CMC test cannot be performed and the results will read F(-99,-99): -99 P: -99
. With ddf≤4, results are reported, but they may not be reliable based on our experiments and the fact that the variance of the $F$ distribution exists only with ddf>4.
In cases where there are as many common variants as polymorphic variants (i.e., no rare variants below MAF cutoff), the CMC yields the same result as Hotelling’s $T^2$ test on all variants because the column of zeros for the rare variant indicator is eliminated as linearly dependent by the G2SWEEP operator.
The implementation of the WSS test follows the description in Madsen and Browning (2009) with four modifications. First, midranks are used to break ties in genetic scores when calculating the rank-sum statistic, $W$. Second, a two-sided p-value is used. Third, the two-sided p-value is estimated directly from the permutation distribution of $W$. Letting $W_t$ denote the value in permutation $t$, $W_{obs}$ denote the observed value in the sample, and $\overline{W}$ denote the mean of $W$ over all $m$ permutations, the two-sided p-value is estimated as:
$$\hat{p}^{WSS}=\frac{\#(|W_t - \overline{W}| \ge |W_{obs} - \overline{W}|) + 1}{m + 1}$$
This p-value follows P:
in the output. The program also estimates the permutation null distribution probability mass function at $|W_{obs} - \overline{W}|$ by:
$$\hat{p}^{WSS}_{tied}=\frac{\#(|W_t - \overline{W}| = |W_{obs} - \overline{W}|) + 1}{m + 1}$$
This probability mass follows P(tied):
in the output. High values of $\hat{p}^{WSS}_{tied}$ are indicative of a highly discrete permutation null distribution. Again, all comparisons in the above calculations are performed using the gsl_fcmp
. implementation of Knuth’s algorithm. Finally, missing single-variant genotypes, which were not considered in the original paper, are not used in estimation of the MAF in controls and assigned values of 0 so as not to contribute to the WSS in an individual. This procedure is equivalent to calculating the genetic score over only nonmissing genotypes in each individual.
References
CA Max/Sum Tests
Kinnamon DD, Hershberger RE, Martin ER. Reconsidering association testing methods using single-variant test statistics as alternatives to pooling tests for sequence data with rare variants. PLoS ONE. 2012;7(2):e30238. Epub 2012 Feb 17. PMID: 22363423.
Knuth’s Floating-Point Comparison Algorithm
Knuth DE. The Art of Computer Programming. 3rd ed. Vol. 2, Seminumerical Algorithms. Upper Saddle River, NJ: Addison-Wesley; 1998. Section 4.2.2; p. 233.
CMC Test
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008 Sep 12;83(3):311-21. Epub 2008 Aug 7. PMID: 18691683.
G2SWEEP Operator
Goodnight JH. A tutorial on the SWEEP operator. Am Stat. 1979;33:149-158.
WSS Test
Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009 Feb;5(2):e1000384. Epub 2009 Feb 13. PMID: 19214210.