Optimal Pattern Matching
The C program runSequenceAlignment is derived from the Matlab program RUN_sequenceAlignment that was written by Bill Mann (formerly of MIT Lincoln Labs) and distributed as version 0.6 of DARPA SSCA #1. Whereas the Matlab code is serial, the C code has been modified for parallel execution under either OpenMP or MPI, following the suggestions given in the "parallelization.txt" file that is included in the version 0.6 distribution. In the present implementation, Kernels 1 and 2 of the 5-kernel Matlab code have been converted to C and parallelized.
The program operates as follows. A similarity or "scoring" matrix is generated by genSimMatrix.c. Two random sequences of amino acid codons (which will be referred to as the "main" and "match" sequences) are generated by genScalData.c, and then six pre-determined verification sequences are embedded therein. Then in Kernel 1 each OpenMP thread or MPI process compares sub-sequences of the two sequences via the local-affine Smith-Waterman algorithm, and builds a list of the best alignments and their endpoints. These lists may be optionally printed for each OpenMP thread or MPI process as directed by several compile-time parameters. Next, in Kernel 2A each OpenMP thread or MPI process begins at each endpoint and follows each alignment back to its start point, and outputs a list of the best alignments and their start points, endpoints and codon sequences. Kernel 2B merges the results of Kernel 2A from all of the OpenMP threads or MPI processes, and outputs a final list of the best alignments.
The optional scale argument is interpreted as 2*log2(sequence length). For example, setting scale to 40 results in sequences that comprise 2^20 codons, and a sequence alignment problem that performs 2^40 character comparisons. The actual sequence length is a bit longer than 2^[(sequence length)/2] because short verification sequences are embedded at random points into the generated sequences. The default value of scale is 30. For parallel execution, a value of scale less than 22 may give erroneous results because each processor may not consider a sufficiently large sub-sequence to find the verification sequences.
For the reference data set, scale is set at 41, for train, scale is 32 and for test, scale is 30.
Correct operation of the Smith-Waterman algorithm may be verified by the presence of the following six verification sequences at the top of the list of alignments that is output following execution of Kernel 2B. The starting and ending positions are unimportant. The amino acids column (*SIMILARTESTS*, *PARTIALCASES*, *IDENTICAL*, *MISQRMATCHES*, *STARTGAPMIDST---END* and *EVENLESSWKDPALIGNS*) and scores (52 to 57) are important: Starting Amino Codon Ending position acids bases position verifyMergeAlignment 1, succeeded; score 57 14827 *SIMILARTESTS* tgaagcataatgatactggcgaggacggagagcacgagctga 14840 10295 *SIMILARTESTS* taatctattatgattttggctcgtactgaatctacttcttaa 10308 verifyMergeAlignment 2, succeeded; score 56 1734 *PARTIALCASES* tgaccggcgaggacgatagcgctgtgcgcgagcgagagctga 1747 5346 *PARTIALCASES* taacctgctcgtactattgctttgtgtgcttctgaatcttaa 5359 verifyMergeAlignment 3, succeeded; score 55 6877 *IDENTICAL* tgaatagacgagaacacgatatgcgcgctgtga 6887 4187 *IDENTICAL* tgaatagacgagaacacgatatgcgcgctgtga 4197 verifyMergeAlignment 4, succeeded; score 54 4250 *MISQRMATCHES* tgaatgataagccagaggatggcgacgtgccacgagagctga 4263 1293 *MISRQMATCHES* tgaatgataagcaggcagatggcgacgtgccacgagagctga 1306 verifyMergeAlignment 5, succeeded; score 53 998 *STARTGAPMIDST---END* tgaagcacggcgaggacgggggcgccgatgatagacagcacg---------gagaacgactga 1015 13243 *START---MIDSTGAPEND* tgaagcacggcgaggacg---------atgatagacagcacgggggcgccggagaacgactga 13260 verifyMergeAlignment 6, succeeded; score 52 16509 *EVENLESSWKDPALIGNS* tgagaggtggagaacctggagagcagctggaaggacccggcgctgatagggaacagctga 16528 2328 *EVENLESSTVMFALIGNS* taagaagttgaaaatttggaatcttctactgttatgtttgctttgattggtaattcttaa 2347 Note that these verification sequences have alignment scores of 52 to 57. Any other alignment sequences that may arise in the random sequences ought to have alignment scores that are considerably less than 52 and no larger than about 45 for random sequences less than 2^20 in length that correspond to a scale variable of 2^40. In general, the alignment scores for random sequences will increase with increasing lengths of random sequences.