Sketching and Sublinear Data Structures in Genomics

Posted: 25 Jul 2019

See all articles by Guillaume Marçais

Guillaume Marçais

Carnegie Mellon University - Computational Biology Department

Brad Solomon

Johns Hopkins University - Department of Computer Science

Rob Patro

State University of New York (SUNY) - Department of Computer Science

Carl Kingsford

Carnegie Mellon University - Computational Biology Department

Date Written: July 2019

Abstract

Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full-text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each.

Suggested Citation

Marçais, Guillaume and Solomon, Brad and Patro, Rob and Kingsford, Carl, Sketching and Sublinear Data Structures in Genomics (July 2019). Annual Review of Biomedical Data Science, Vol. 2, pp. 93-118, 2019, Available at SSRN: https://ssrn.com/abstract=3425714 or http://dx.doi.org/10.1146/annurev-biodatasci-072018-021156

Guillaume Marçais (Contact Author)

Carnegie Mellon University - Computational Biology Department ( email )

Pittsburgh, PA 15213
United States

Brad Solomon

Johns Hopkins University - Department of Computer Science ( email )

United States

Rob Patro

State University of New York (SUNY) - Department of Computer Science ( email )

United States

Carl Kingsford

Carnegie Mellon University - Computational Biology Department ( email )

Pittsburgh, PA 15213
United States

Do you have negative results from your research you’d like to share?

Paper statistics

Abstract Views
247
PlumX Metrics