Algorithms in Bioinformatics (CSI 5126) Review and presentation of a scientific paper

Algorithms in Bioinformatics (CSI 5126)
Review and presentation of a scientific paper — Fall 2010

Instructor: Lucia Moura

Version of October 12, 2010

1 Deadline

October 17: Email professor your choice of article (see Section 3.2 below)
November 2: 15–20 minutes presentation; summary of the publication

2 Objectives

Thoroughly study of a specific topic in exact or approximate string matching
Familiarity with the modes of communicating research
Develop your presentation skills

3 Directives

Papers in (refereed) journals and conference proceedings are the main vehicles for communicating scientific information. November 2 (11:30-2:30) will be dedicated to the presentation of scientific publications in the field of exact and approximate string matching. The foundations of string matching algorithms has been presented in class, in the weeks preceding the presentations. You must select a publication that presents either a specialized application or a more efficient algorithm.

3.1 Deliverable

One or two pages summary of the publication
15–20 minutes presentation

3.2 Selected publications

You must select a publication in a distinct area than that of your project. Below you will find a list of publications. If you intend to propose a publication outside of the list, you must contact me by October 15 (E-mail me the details of the publications and a pdf, if available) and wait for my approval to make sure that your choice meets the didactic needs of the course.

Short read alignment
1. Langmead et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol (2009) vol. 10 (3) pp. R25
2. Li and Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (2009) vol. 25 (14) pp. 1754-60
3. Rumble et al. SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol (2009) vol. 5 (5) pp. e1000386
Faster sequence alignment
1. Newberg. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics (2008) vol. 24 (16) pp. 1772-8
2. M. Cameron, Y. Bernstein, and H. E. Williams. (2007) Clustered sequence representation for fast homology search. J Comput Biol, 14(5):594–614.
3. M. Cameron and H. E. Williams. (2007) Comparing compressed sequences for faster nucleotide blast searches. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 4(3):349–64.
4. Cameron M, Williams HE, Cannane A. (2006) A deterministic finite automaton for faster protein hit detection in BLAST. J Comput Biol. 2006 May;13(4):965-78.
5. Rasmussen et al. Efficient q-gram filters for finding all epsilon-matches over a given length. J Comput Biol (2006) vol. 13 (2) pp. 296-308
6. X. Cui, T. Vinar, B. Brejová, D. Shasha, and M. Li. (2007) Homology search for genes. Bioinformatics, 23(13):i97–103.
7. Ning Z, Cox AJ, Mullikin JC. (2001) SSAHA: a fast search method for large DNA databases. Genome Res. 11(10):1725-9.
8. Myers G, Durbin R. (2003) A table-driven, full-sensitivity similarity search algorithm. J Comput Biol. 2003;10(2):103-17.
9. Itoh M, Goto S, Akutsu T, Kanehisa M. (2005) Fast and accurate database homology search using upper bounds of local alignment scores. Bioinformatics. 2005 Apr 1;21(7):912-21.
10. Zhang H. (2003) Alignment of BLAST high-scoring segment pairs based on the longest increasing subsequence algorithm. Bioinformatics. 2003 Jul 22;19(11):1391-6.
11. W. J. Kent. (2002) Blat–the blast-like alignment tool. Genome Res, 12(4):656–64.
Pairwise alignment of genomic sequences
1. Abouelhoda et al. CoCoNUT: an efficient system for the comparison and analysis of genomes. BMC Bioinformatics (2008) vol. 9 (1) pp. 476
2. A. C.-C. Shih and W.-H. Li. (2003) GS-Aligner: a novel tool for aligning genomic sequences using bit-level operations. Mol Biol Evol, 20(8):1299–309.
3. U. Schulze, B. Hepp, C. S. Ong, and G. Rätsch. (2007) Palma: mRNA to genome alignments using large margin algorithms. Bioinformatics, 23(15):1892–900, Aug 2007.
4. Delcher AL, Phillippy A, Carlton J, Salzberg SL. (2002) Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83.
5. A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg. (1999) Alignment of whole genomes. Nucleic Acids Res, 27(11):2369–76, Jun 1999.
6. Kahveci T, Ljosa V, Singh AK. (2004) Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics. 2004 Sep 1;20(13):2122-34.
7. Kalafus KJ, Jackson AR, Milosavljevic A. (2004) Pash: efficient genome-scale sequence anchoring by Positional Hashing. Genome Res. 2004 Apr;14(4):672-8.
8. Huang W, Umbach DM, Li L. (2006) Accurate anchoring alignment of divergent sequences. Bioinformatics. 2006 Jan 1;22(1):29-34.
9. C. N. Dewey, P. M. Huggins, K. Woods, B. Sturmfels, and L. Pachter. (2006) Parametric alignment of drosophila genomes. PLoS Comput Biol, 2(6):e73.
Median string problem
1. Yue and Tang. A Divide-and-Conquer Implementation of Three Sequence Alignment and Ancestor Inference. Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference on (2007) pp. 143 - 150
Multiple sequence alignment
1. Bradley et al. Fast statistical alignment. PLoS Comput Biol (2009) vol. 5 (5) pp. e1000392
2. C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 15(2):330–40.
3. A. S. Konagurthu and P. J. Stuckey. (2006) Optimal sum-of-pairs multiple sequence alignment using incremental carrillo and lipman bounds. J Comput Biol, 13(3):668–85.
4. T. J. Wheeler and J. D. Kececioglu. (2007) Multiple alignment by aligning alignments. Bioinformatics, 23(13):i559–68.
5. M. Kruspe and P. F. Stadler. (2007) Progressive multiple sequence alignments from triplets. BMC Bioinformatics, 8:254, Jul 2007.
6. I. Elias. (2006) Settling the intractability of multiple alignment. J Comput Biol, 13(7):1323–39.
7. Higgins DG, Blackshields G, Wallace IM. (2005) Mind the gaps: progress in progressive alignment. Proc Natl Acad Sci U S A. 2005 Jul 26;102(30):10411-2. 18.
8. Loytynoja A, Goldman N. (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A. 2005 Jul 26;102(30):10557-62.
9. Pei J, Sadreyev R, Grishin NV. (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics. 2003 Feb 12;19(3):427-8.
10. Hudek AK, Brown DG. (2005) Ancestral sequence alignment under optimal conditions. BMC Bioinformatics. 2005 Nov 17;6:273.
11. Notredame C, Higgins DG, Heringa J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17.
12. Lassmann T, Sonnhammer EL. (2005) Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics. 2005 Dec 12;6:298.
13. Edgar RC. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004 Mar 19;32(5):1792-7.
14. Grasso C, Lee C. (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics. 2004 Jul 10;20(10):1546-56.
Multiple sequence alignment of genomic sequence
1. Dubchak et al. Multiple whole-genome alignments without a reference organism. Genome Res (2009) vol. 19 (4) pp. 682-9
2. Paten et al. Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics (2009) vol. 25 (3) pp. 295-301
3. C. Wang and E. J. Lefkowitz. (2005) Genomic multiple sequence alignments: refinement using a genetic algorithm. BMC Bioinformatics, 6:200.
4. M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. A. Smit, K. M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green, D. Haussler, and W. Miller. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res, 14(4):708–15.
5. Bray N, Pachter L. (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 2004 Apr;14(4):693-9.
6. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E; NISC Comparative Sequencing Program; Green ED, Sidow A, Batzoglou S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003 Apr;13(4):721-31.
7. M. Brudno, C. B. Do, G. M. Cooper, M. F. Kim, E. Davydov, N. C. S. Program, E. D. Green, A. Sidow, and S. Batzoglou. (2003) Lagan and multi-lagan: efficient tools for large-scale multiple alignment of genomic dna. Genome Res, 13(4):721–31.
8. Brudno M, Chapman M, Gottgens B, Batzoglou S, Morgenstern B. (2003) Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics. 2003 Dec 23;4:66.
9. Wang C, Lefkowitz EJ. (2005) Genomic multiple sequence alignments: refinement using a genetic algorithm. BMC Bioinformatics. 2005 Aug 8;6:200.
10. Brudno M, Chapman M, Gottgens B, Batzoglou S, Morgenstern B. (2003) Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics. 2003 Dec 23;4:66.
11. Hohl M, Kurtz S, Ohlebusch E. (2002) Efficient multiple genome alignment. Bioinformatics. 2002;18 Suppl 1:S312-20.
12. Zhang Y, Waterman MS. (2003) An Eulerian path approach to global multiple alignment for DNA sequences. J Comput Biol. 2003;10(6):803-19.
Seeded alignment methods
1. Flannick J, Batzoglou S. (2005) Using multiple alignments to improve seeded local alignment algorithms. Nucleic Acids Res. 2005 Aug 12;33(14):4563-77.
2. Ma B, Tromp J, Li M. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002 Mar;18(3):440-5.
3. Li M, Ma B, Kisman D, Tromp J. (2003) PatternHunter II: highly sensitive and fast homology search. Genome Inform. 2003;14:164-75.
4. Gotea V, Veeramachaneni V, Makalowski W. (2003) Mastering seeds for genomic size nucleotide BLAST searches. Nucleic Acids Res. 2003 Dec 1;31(23):6935-41.
Parallel, quantum or hardware implementations
1. Trapnell and Schatz. Optimizing data intensive GPGPU computations for DNA sequence alignment. Parallel Computing (2009) vol. 35 (8-9) pp. 429-440
2. Saebo PE, Andersen SM, Myrseth J, Laerdahl JK, Rognes T. (2005) PARALIGN: rapid and sensitive sequence similarity searches powered by parallel computing technology. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W535-9.
3. Rognes T, Seeberg E. (2000) Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics. 2000 Aug;16(8):699-706.
4. Rognes T. (2001) ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Res. 2001 Apr 1;29(7):1647-52.
5. Qi Y, Lin F. (2005) Parallelisation of the blast algorithm. Cell Mol Biol Lett. 2005;10(2):281-5.
6. P. E. Saebř, S. M. Andersen, J. Myrseth, J. K. Laerdahl, and T. Rognes. (2005) Paralign: rapid and sensitive sequence similarity searches powered by parallel computing technology. Nucleic Acids Res, 33(Web Server issue):W535–9.
7. Hollenberg LC. (2000) Fast quantum search algorithms in protein sequence comparisons: quantum bioinformatics. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 2000 Nov;62(5 Pt B):7532-5.
8. R. P. Jacobi, M. Ayala-Rincón, L. G. A. Carvalho, C. H. Llanos, and R. W. Hartenstein. (2005) Reconfigurable systems for sequence alignment and for general dynamic programming. Genet Mol Res, 4(3):543–52.
Statistics
1. Bastien and Marchal. Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores. BMC Bioinformatics (2008) vol. 9 pp. 332
2. Newberg. Significance of gapped sequence alignments. J Comput Biol (2008) vol. 15 (9) pp. 1187-94
3. S. Wolfsheimer, B. Burghardt, and A. K. Hartmann. (2007) Local sequence alignments statistics: deviations from gumbel statistics in the rare-event tail. Algorithms for molecular biology : AMB, 2:9.
4. G. Landan and D. Graur. (2007) Heads or tails: a simple reliability check for multiple sequence alignments. Mol Biol Evol, 24(6):1380–3, Jun 2007.
5. D. Metzler. (2006) Robust e-values for gapped local alignments. J Comput Biol, 13(4):882–96, May 2006.
6. A. Y. Mitrophanov and M. Borodovsky. (2006) Statistical significance in biological sequence analysis. Brief Bioinform, 7(1):2–24, Mar 2006.
Applications of suffix trees and other indexing techniques
1. Phoophakdee and Zaki. TRELLIS+: an effective approach for indexing genome-scale sequences using suffix trees. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing (2008) pp. 90-101
2. Khan et al. A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays. Bioinformatics (2009) vol. 25 (13) pp. 1609-16
3. Ohlebusch and Kurtz. Space efficient computation of rare maximal exact matches between multiple sequences. J Comput Biol (2008) vol. 15 (4) pp. 357-77
4. Herold et al. Efficient computation of absent words in genomic sequences. BMC Bioinformatics (2008) vol. 9 pp. 167
5. Hampikian and Andersen. Absent sequences: nullomers and primes. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing (2007) pp. 355-66
Profile alignment methods
1. Pizzi and Ukkonen. Fast profile matching algorithms - A survey. Theoretical Computer Science (2008) vol. 395 (2-3) pp. 137-157
2. M. Beckstette, R. Homann, R. Giegerich, and S. Kurtz. (2006) Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics, 7:389.
3. Yona G, Levitt M. (2002) Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol. 2002 Feb 1;315(5):1257-75.
Rearrangements
1. M. Brudno, S. Malde, A. Poliakov, C. B. Do, O. Couronne, I. Dubchak, and S. Batzoglou. (2003) Glocal alignment: finding rearrangements during alignment. Bioinformatics, 19 Suppl 1:i54–62.
2. T. M. Phuong, C. B. Do, R. C. Edgar, and S. Batzoglou. Multiple alignment of protein sequences with repeats and rearrangements. Nucleic Acids Res, 34(20):5932–42, Jan 2006.
Repetitive elements
1. Reneker J, Shyu CR. (2005) Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals. BMC Bioinformatics. 2005 May 3;6:111.
2. Edgar RC, Myers EW. (2005) PILER: identification and classification of genomic repeats. Bioinformatics. 2005 Jun;21 Suppl 1:i152-8.
3. Price AL, Jones NC, Pevzner PA. (2005) De novo identification of repeat families in large genomes. Bioinformatics. 2005 Jun;21 Suppl 1:i351-8.
4. Sagot MF, Myers EW. (1998) Identifying satellites and periodic repetitions in biological sequences. J Comput Biol. 1998 Fall;5(3):539-53.
5. Martin DE. (2006) The Exact Joint Distribution of the Sum of Heads and Apparent Size Statistics of a “Tandem Repeats Finder” Algorithm. Bull Math Biol. 2006 Aug
6. Delgrange O, Rivals E. (2004) STAR: an algorithm to Search for Tandem Approximate Repeats. Bioinformatics. 2004 Nov 1;20(16):2812-20. Epub 2004 Jun 4.
7. Krishnan A, Tang F. (2004) Exhaustive whole-genome tandem repeats search. Bioinformatics. 2004 Nov 1;20(16):2702-10. Epub 2004 May 14.
8. Parisi V, De Fonzo V, Aluffi-Pentini F. (2003) STRING: finding tandem repeats in DNA sequences. Bioinformatics. 2003 Sep 22;19(14):1733-8.
9. Hauth AM, Joseph DA. (2002) Beyond tandem repeats: complex pattern structures and distant regions of similarity. Bioinformatics. 2002;18 Suppl 1:S31-7.
10. Adebiyi EF, Jiang T, Kaufmann M. (2001) An efficient algorithm for finding short approximate non-tandem repeats. Bioinformatics. 2001;17 Suppl 1:S5-S12.
11. Landau GM, Schmidt JP, Sokol D. (2001) An algorithm for approximate tandem repeats. J Comput Biol. 2001;8(1):1-18.
12. Apostolico A, Bock ME, Lonardi S, Xu X. (2001) Efficient detection of unusual words. J Comput Biol. 2000 Feb-Apr;7(1-2):71-94.
13. Allison L, Stern L, Edgoose T, Dix TI. (2000) Sequence complexity for biological sequence analysis. Comput Chem. 2000 Jan;24(1):43-55.

Suitable topics include: software or hardware acceleration of the sequence alignment problem, heuristics for accelerating alignment methods, improved substitutions or gap models, repeat finding methods, including tandem repeats. Avoid subjects such as motifs detections, which in general involves coupling approximate matching algorithms with statistical inference tools — we will specifically visit these topics toward the end of the semester.

6 Evaluation criteria

TBA