Algorithms in Bioinformatics (CSI 5126)
Review and presentation of a scientific paper — Fall 2010
Instructor: Lucia Moura
Version of October 12, 2010
1 Deadline
- October 17: Email professor your choice of article (see Section 3.2 below)
- November 2: 15–20 minutes presentation; summary of the publication
2 Objectives
- Thoroughly study of a specific topic in exact or approximate string matching
- Familiarity with the modes of communicating research
- Develop your presentation skills
3 Directives
Papers in (refereed) journals and conference proceedings are the main vehicles for communicating scientific
information. November 2 (11:30-2:30) will be dedicated to the presentation of scientific publications in the field of exact and
approximate string matching. The foundations of string matching algorithms has been presented in class, in the
weeks preceding the presentations. You must select a publication that presents either a specialized application or a
more efficient algorithm.
3.1 Deliverable
- One or two pages summary of the publication
- 15–20 minutes presentation
3.2 Selected publications
You must select a publication in a distinct area than that of your project. Below you will find a list of publications.
If you intend to propose a publication outside of the list, you must contact me by October 15 (E-mail me the
details of the publications and a pdf, if available) and wait for my approval to make sure that your choice meets the didactic needs of the
course.
- Short read alignment
- Langmead et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol (2009) vol. 10 (3) pp. R25
- Li and Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics (2009) vol. 25 (14) pp. 1754-60
- Rumble et al. SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol (2009)
vol. 5 (5) pp. e1000386
- Faster sequence alignment
- Newberg. Memory-efficient dynamic programming backtrace and pairwise local sequence
alignment. Bioinformatics (2008) vol. 24 (16) pp. 1772-8
- M. Cameron, Y. Bernstein, and H. E. Williams. (2007) Clustered sequence representation for
fast homology search. J Comput Biol, 14(5):594–614.
- M. Cameron and H. E. Williams. (2007) Comparing compressed sequences for faster nucleotide
blast searches. IEEE/ACM transactions on computational biology and bioinformatics / IEEE,
ACM, 4(3):349–64.
- Cameron M, Williams HE, Cannane A. (2006) A deterministic finite automaton for faster protein
hit detection in BLAST. J Comput Biol. 2006 May;13(4):965-78.
- Rasmussen et al. Efficient q-gram filters for finding all epsilon-matches over a given length. J
Comput Biol (2006) vol. 13 (2) pp. 296-308
- X. Cui, T. Vinar, B. Brejová, D. Shasha, and M. Li. (2007) Homology search for genes.
Bioinformatics, 23(13):i97–103.
- Ning Z, Cox AJ, Mullikin JC. (2001) SSAHA: a fast search method for large DNA databases.
Genome Res. 11(10):1725-9.
- Myers G, Durbin R. (2003) A table-driven, full-sensitivity similarity search algorithm. J Comput
Biol. 2003;10(2):103-17.
- Itoh M, Goto S, Akutsu T, Kanehisa M. (2005) Fast and accurate database homology search using
upper bounds of local alignment scores. Bioinformatics. 2005 Apr 1;21(7):912-21.
- Zhang H. (2003) Alignment of BLAST high-scoring segment pairs based on the longest increasing
subsequence algorithm. Bioinformatics. 2003 Jul 22;19(11):1391-6.
- W. J. Kent. (2002) Blat–the blast-like alignment tool. Genome Res, 12(4):656–64.
- Pairwise alignment of genomic sequences
- Abouelhoda et al. CoCoNUT: an efficient system for the comparison and analysis of genomes.
BMC Bioinformatics (2008) vol. 9 (1) pp. 476
- A. C.-C. Shih and W.-H. Li. (2003) GS-Aligner: a novel tool for aligning genomic sequences using
bit-level operations. Mol Biol Evol, 20(8):1299–309.
- U. Schulze, B. Hepp, C. S. Ong, and G. Rätsch. (2007) Palma: mRNA to genome alignments
using large margin algorithms. Bioinformatics, 23(15):1892–900, Aug 2007.
- Delcher AL, Phillippy A, Carlton J, Salzberg SL. (2002) Fast algorithms for large-scale genome
alignment and comparison. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83.
- A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg. (1999)
Alignment of whole genomes. Nucleic Acids Res, 27(11):2369–76, Jun 1999.
- Kahveci T, Ljosa V, Singh AK. (2004) Speeding up whole-genome alignment by indexing frequency
vectors. Bioinformatics. 2004 Sep 1;20(13):2122-34.
- Kalafus KJ, Jackson AR, Milosavljevic A. (2004) Pash: efficient genome-scale sequence anchoring
by Positional Hashing. Genome Res. 2004 Apr;14(4):672-8.
- Huang W, Umbach DM, Li L. (2006) Accurate anchoring alignment of divergent sequences.
Bioinformatics. 2006 Jan 1;22(1):29-34.
- C. N. Dewey, P. M. Huggins, K. Woods, B. Sturmfels, and L. Pachter. (2006) Parametric
alignment of drosophila genomes. PLoS Comput Biol, 2(6):e73.
- Median string problem
- Yue and Tang. A Divide-and-Conquer Implementation of Three Sequence Alignment and Ancestor
Inference. Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference
on (2007) pp. 143 - 150
- Multiple sequence alignment
- Bradley et al. Fast statistical alignment. PLoS Comput Biol (2009) vol. 5 (5) pp. e1000392
- C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. (2005) ProbCons:
Probabilistic consistency-based multiple sequence alignment. Genome Res, 15(2):330–40.
- A. S. Konagurthu and P. J. Stuckey. (2006) Optimal sum-of-pairs multiple sequence alignment
using incremental carrillo and lipman bounds. J Comput Biol, 13(3):668–85.
- T. J. Wheeler and J. D. Kececioglu. (2007) Multiple alignment by aligning alignments.
Bioinformatics, 23(13):i559–68.
- M. Kruspe and P. F. Stadler. (2007) Progressive multiple sequence alignments from triplets.
BMC Bioinformatics, 8:254, Jul 2007.
- I. Elias. (2006) Settling the intractability of multiple alignment. J Comput Biol, 13(7):1323–39.
- Higgins DG, Blackshields G, Wallace IM. (2005) Mind the gaps: progress in progressive alignment.
Proc Natl Acad Sci U S A. 2005 Jul 26;102(30):10411-2. 18.
- Loytynoja A, Goldman N. (2005) An algorithm for progressive multiple alignment of sequences
with insertions. Proc Natl Acad Sci U S A. 2005 Jul 26;102(30):10557-62.
- Pei J, Sadreyev R, Grishin NV. (2003) PCMA: fast and accurate multiple sequence alignment
based on profile consistency. Bioinformatics. 2003 Feb 12;19(3):427-8.
- Hudek AK, Brown DG. (2005) Ancestral sequence alignment under optimal conditions. BMC
Bioinformatics. 2005 Nov 17;6:273.
- Notredame C, Higgins DG, Heringa J. (2000) T-Coffee: A novel method for fast and accurate
multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17.
- Lassmann T, Sonnhammer EL. (2005) Kalign–an accurate and fast multiple sequence alignment
algorithm. BMC Bioinformatics. 2005 Dec 12;6:298.
- Edgar RC. (2004) MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res. 2004 Mar 19;32(5):1792-7.
- Grasso C, Lee C. (2004) Combining partial order alignment and progressive multiple sequence
alignment increases alignment speed and scalability to very large alignment problems.
Bioinformatics. 2004 Jul 10;20(10):1546-56.
- Multiple sequence alignment of genomic sequence
- Dubchak et al. Multiple whole-genome alignments without a reference organism. Genome Res
(2009) vol. 19 (4) pp. 682-9
- Paten et al. Sequence progressive alignment, a framework for practical large-scale probabilistic
consistency alignment. Bioinformatics (2009) vol. 25 (3) pp. 295-301
- C. Wang and E. J. Lefkowitz. (2005) Genomic multiple sequence alignments: refinement using a
genetic algorithm. BMC Bioinformatics, 6:200.
- M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. A. Smit, K. M. Roskin, R. Baertsch,
K. Rosenbloom, H. Clawson, E. D. Green, D. Haussler, and W. Miller. (2004) Aligning multiple
genomic sequences with the threaded blockset aligner. Genome Res, 14(4):708–15.
- Bray N, Pachter L. (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome
Res. 2004 Apr;14(4):693-9.
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E; NISC Comparative Sequencing Program;
Green ED, Sidow A, Batzoglou S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale
multiple alignment of genomic DNA. Genome Res. 2003 Apr;13(4):721-31.
- M. Brudno, C. B. Do, G. M. Cooper, M. F. Kim, E. Davydov, N. C. S. Program, E. D. Green,
A. Sidow, and S. Batzoglou. (2003) Lagan and multi-lagan: efficient tools for large-scale multiple
alignment of genomic dna. Genome Res, 13(4):721–31.
- Brudno M, Chapman M, Gottgens B, Batzoglou S, Morgenstern B. (2003) Fast and sensitive
multiple alignment of large genomic sequences. BMC Bioinformatics. 2003 Dec 23;4:66.
- Wang C, Lefkowitz EJ. (2005) Genomic multiple sequence alignments: refinement using a genetic
algorithm. BMC Bioinformatics. 2005 Aug 8;6:200.
- Brudno M, Chapman M, Gottgens B, Batzoglou S, Morgenstern B. (2003) Fast and sensitive
multiple alignment of large genomic sequences. BMC Bioinformatics. 2003 Dec 23;4:66.
- Hohl M, Kurtz S, Ohlebusch E. (2002) Efficient multiple genome alignment. Bioinformatics.
2002;18 Suppl 1:S312-20.
- Zhang Y, Waterman MS. (2003) An Eulerian path approach to global multiple alignment for DNA
sequences. J Comput Biol. 2003;10(6):803-19.
- Seeded alignment methods
- Flannick J, Batzoglou S. (2005) Using multiple alignments to improve seeded local alignment
algorithms. Nucleic Acids Res. 2005 Aug 12;33(14):4563-77.
- Ma B, Tromp J, Li M. (2002) PatternHunter: faster and more sensitive homology search.
Bioinformatics. 2002 Mar;18(3):440-5.
- Li M, Ma B, Kisman D, Tromp J. (2003) PatternHunter II: highly sensitive and fast homology
search. Genome Inform. 2003;14:164-75.
- Gotea V, Veeramachaneni V, Makalowski W. (2003) Mastering seeds for genomic size nucleotide
BLAST searches. Nucleic Acids Res. 2003 Dec 1;31(23):6935-41.
- Parallel, quantum or hardware implementations
- Trapnell and Schatz. Optimizing data intensive GPGPU computations for DNA sequence
alignment. Parallel Computing (2009) vol. 35 (8-9) pp. 429-440
- Saebo PE, Andersen SM, Myrseth J, Laerdahl JK, Rognes T. (2005) PARALIGN: rapid and
sensitive sequence similarity searches powered by parallel computing technology. Nucleic Acids
Res. 2005 Jul 1;33(Web Server issue):W535-9.
- Rognes T, Seeberg E. (2000) Six-fold speed-up of Smith-Waterman sequence database searches
using parallel processing on common microprocessors. Bioinformatics. 2000 Aug;16(8):699-706.
- Rognes T. (2001) ParAlign: a parallel sequence alignment algorithm for rapid and sensitive
database searches. Nucleic Acids Res. 2001 Apr 1;29(7):1647-52.
- Qi Y, Lin F. (2005) Parallelisation of the blast algorithm. Cell Mol Biol Lett. 2005;10(2):281-5.
- P. E. Saebø, S. M. Andersen, J. Myrseth, J. K. Laerdahl, and T. Rognes. (2005) Paralign: rapid
and sensitive sequence similarity searches powered by parallel computing technology. Nucleic Acids
Res, 33(Web Server issue):W535–9.
- Hollenberg LC. (2000) Fast quantum search algorithms in protein sequence comparisons: quantum
bioinformatics. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 2000 Nov;62(5 Pt
B):7532-5.
- R. P. Jacobi, M. Ayala-Rincón, L. G. A. Carvalho, C. H. Llanos, and R. W. Hartenstein. (2005)
Reconfigurable systems for sequence alignment and for general dynamic programming. Genet Mol
Res, 4(3):543–52.
- Statistics
- Bastien and Marchal. Evolution of biological sequences implies an extreme value distribution of
type I for both global and local pairwise alignment scores. BMC Bioinformatics (2008) vol. 9 pp.
332
- Newberg. Significance of gapped sequence alignments. J Comput Biol (2008) vol. 15 (9) pp. 1187-94
- S. Wolfsheimer, B. Burghardt, and A. K. Hartmann. (2007) Local sequence alignments statistics:
deviations from gumbel statistics in the rare-event tail. Algorithms for molecular biology : AMB,
2:9.
- G. Landan and D. Graur. (2007) Heads or tails: a simple reliability check for multiple sequence
alignments. Mol Biol Evol, 24(6):1380–3, Jun 2007.
- D. Metzler. (2006) Robust e-values for gapped local alignments. J Comput Biol, 13(4):882–96,
May 2006.
- A. Y. Mitrophanov and M. Borodovsky. (2006) Statistical significance in biological sequence
analysis. Brief Bioinform, 7(1):2–24, Mar 2006.
- Applications of suffix trees and other indexing techniques
- Phoophakdee and Zaki. TRELLIS+: an effective approach for indexing genome-scale sequences
using suffix trees. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing (2008)
pp. 90-101
- Khan et al. A practical algorithm for finding maximal exact matches in large sequence datasets
using sparse suffix arrays. Bioinformatics (2009) vol. 25 (13) pp. 1609-16
- Ohlebusch and Kurtz. Space efficient computation of rare maximal exact matches between multiple
sequences. J Comput Biol (2008) vol. 15 (4) pp. 357-77
- Herold et al. Efficient computation of absent words in genomic sequences. BMC Bioinformatics
(2008) vol. 9 pp. 167
- Hampikian and Andersen. Absent sequences: nullomers and primes. Pacific Symposium on
Biocomputing Pacific Symposium on Biocomputing (2007) pp. 355-66
- Profile alignment methods
- Pizzi and Ukkonen. Fast profile matching algorithms - A survey. Theoretical Computer Science
(2008) vol. 395 (2-3) pp. 137-157
- M. Beckstette, R. Homann, R. Giegerich, and S. Kurtz. (2006) Fast index based algorithms and
software for matching position specific scoring matrices. BMC Bioinformatics, 7:389.
- Yona G, Levitt M. (2002) Within the twilight zone: a sensitive profile-profile comparison tool
based on information theory. J Mol Biol. 2002 Feb 1;315(5):1257-75.
- Rearrangements
- M. Brudno, S. Malde, A. Poliakov, C. B. Do, O. Couronne, I. Dubchak, and S. Batzoglou.
(2003) Glocal alignment: finding rearrangements during alignment. Bioinformatics, 19 Suppl
1:i54–62.
- T. M. Phuong, C. B. Do, R. C. Edgar, and S. Batzoglou. Multiple alignment of protein sequences
with repeats and rearrangements. Nucleic Acids Res, 34(20):5932–42, Jan 2006.
- Repetitive elements
- Reneker J, Shyu CR. (2005) Refined repetitive sequence searches utilizing a fast hash function
and cross species information retrievals. BMC Bioinformatics. 2005 May 3;6:111.
- Edgar RC, Myers EW. (2005) PILER: identification and classification of genomic repeats.
Bioinformatics. 2005 Jun;21 Suppl 1:i152-8.
- Price AL, Jones NC, Pevzner PA. (2005) De novo identification of repeat families in large genomes.
Bioinformatics. 2005 Jun;21 Suppl 1:i351-8.
- Sagot MF, Myers EW. (1998) Identifying satellites and periodic repetitions in biological sequences.
J Comput Biol. 1998 Fall;5(3):539-53.
- Martin DE. (2006) The Exact Joint Distribution of the Sum of Heads and Apparent Size Statistics
of a “Tandem Repeats Finder” Algorithm. Bull Math Biol. 2006 Aug
- Delgrange O, Rivals E. (2004) STAR: an algorithm to Search for Tandem Approximate Repeats.
Bioinformatics. 2004 Nov 1;20(16):2812-20. Epub 2004 Jun 4.
- Krishnan A, Tang F. (2004) Exhaustive whole-genome tandem repeats search. Bioinformatics.
2004 Nov 1;20(16):2702-10. Epub 2004 May 14.
- Parisi V, De Fonzo V, Aluffi-Pentini F. (2003) STRING: finding tandem repeats in DNA sequences.
Bioinformatics. 2003 Sep 22;19(14):1733-8.
- Hauth AM, Joseph DA. (2002) Beyond tandem repeats: complex pattern structures and distant
regions of similarity. Bioinformatics. 2002;18 Suppl 1:S31-7.
- Adebiyi EF, Jiang T, Kaufmann M. (2001) An efficient algorithm for finding short approximate
non-tandem repeats. Bioinformatics. 2001;17 Suppl 1:S5-S12.
- Landau GM, Schmidt JP, Sokol D. (2001) An algorithm for approximate tandem repeats. J
Comput Biol. 2001;8(1):1-18.
- Apostolico A, Bock ME, Lonardi S, Xu X. (2001) Efficient detection of unusual words. J Comput
Biol. 2000 Feb-Apr;7(1-2):71-94.
- Allison L, Stern L, Edgoose T, Dix TI. (2000) Sequence complexity for biological sequence analysis.
Comput Chem. 2000 Jan;24(1):43-55.
Suitable topics include: software or hardware acceleration of the sequence alignment problem, heuristics for
accelerating alignment methods, improved substitutions or gap models, repeat finding methods, including tandem
repeats. Avoid subjects such as motifs detections, which in general involves coupling approximate matching
algorithms with statistical inference tools — we will specifically visit these topics toward the end of the
semester.
6 Evaluation criteria
TBA