Calculation of P-Values using Monte Carlo

Importance Sampling for the Affected Sib Pair Test


A program for calculation of p-values for the ASP (Affected sib pair) test based on
a manuscript by James Malley, Daniel Naiman and Joan Bailey-Wilson can be freely
downloaded here.

Given a collection of m markers on a single chromosome, with known inter-marker
physical distance, it is assumed that allele sharing for each marker is measured for
a collection of  p affected sib pairs (0=no sharing, 1=1 allele shared, 2=2 alleles shared) .
The total  allele sharing y[i] at each marker i is standardized to give a z-score z[i]=(y[i]-p)/sqrt(p/2)
at each marker.  To test whether some marker among the m of them is significantly linked to
the disease, we reject the null hypothesis of no linkage if max {z[i],i=1,...,m} is sufficiently
large.

The program computes the p-value for this test using naive monte carlo simulation,
and an importance sampling algorithm based on an algorithm introduced in a paper by
Naiman and Priebe (2001).The program takes the following as command inputs:

    number of times to replicate the monte carlo simulation  (this can be taken to be 1)
    monte carlo sample size
    maximal z-score from the experiment
    the number of markers in a fine grid that the actual markers are to be embedded in
        (best to use a highly composite integer, e.g. a power of 2 for this)
    actual number of markers used
    seed for the monte carlo simulation (a positive integer)
    indicator of whether to use unequally spaced markers given as standard input
        or equispaced markers
        (0 means use equispaced markers, 1 means specify markers in standard input)

If the option of specifying unequally spaced markers is used (last argument = 1) then
the positions of the markers (in kbp from one end of the chromosome) are given as standard input
to the program with one position per line. The actual positions don't matter, what really matters
is the inter-marker distances (in kbp).

The reason for the first argument is that the user might wish to compare naive simulation
to imporance sampling. The program repeats the monte carlo experiment a prescribed
number of times and computes the relative efficiency of importance sampling, which is
defined to be the ratio of computation times that leads to importance sampling and naive
sampling giving the same variances.
 

Examples.

./asp 1 10000 2.50 64 6 27482 0

means run the program a single time using a sample size of 10000. The maximal z-statistic
is taken to be 2.50. There are 6 equispaced markers in a grid of 64. The seed is 27482.
Here is the output:

number of trials = 1
mc_sample_size = 10000
maximum z-statistic = 2.570000
Bonferroni p-value = 0.03050956
total number of markers in fine grid = 64
total number of markers used in test = 6
summary statistics
naive time =    0.77000 importance time =    0.87000
average naive p-value estimate =         0.0060000000
average importance p-value estimate =         0.0050849261
 

./asp 10 10000 2.50 64 6 27482 0

means do the above 10 times. Here is the output:

number of trials = 10
mc_sample_size = 10000
maximum z-statistic = 2.570000
Bonferroni p-value = 0.03050956
total number of markers in fine grid = 64
total number of markers used in test = 6
summary statistics
naive time =    7.84000 importance time =    9.22000
average naive p-value estimate =         0.0052100000
std dev naive p-value estimates =        11.5895262752
average importance p-value estimate =         0.0050849261
std dev importance p-value estimates =        11.5896581046
relative efficiency =      0.850

Finally,  here is an example where the markers are specified. A file "marker_list"
contains 6 lines as shown here:

5424
5733
5893
6217
6313
6710
 

giving 6 marker positions along the chromosome in kbp.
To input these positions to the program and do a single run, we use

./asp 1 10000 2.50 64 6 27482 0 < marker_list

and the output looks like this:

number of trials = 1
mc_sample_size = 10000
maximum z-statistic = 2.570000
Bonferroni p-value = 0.03050956
total number of markers in fine grid = 64
total number of markers used in test = 6
summary statistics
naive time =    0.87000 importance time =    0.85000
average naive p-value estimate =         0.0093000000
average importance p-value estimate =         0.0080437429
 

The program runs in a linux x86 environment.  The binary executable can be
downloaded.  click here for download
 

References

Malley, J., Naiman, D.Q., Bailey-Wilson, J..
``A Comprehensive Method for Genome Scans''  (2002).
Submitted for publication.

Naiman, D.Q. and Priebe, C.E. ``Computing Scan Statistic p-Values using Importance
Sampling with Applications to Genetics and Medical Image Analysis'' (2001) Journal of
Computational and Graphical Statistics 10 296-328.