Faculty Articles

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Muhammad Usman Tariq, Kennesaw State University
Muhammad Haseeb, School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA.
Mohammed Aledhari, Kennesaw State University
Rehma Razzak, Kennesaw State University
Reza M. Parizi, College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA.
Fahad Saeed, School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA.

Department

Computer Science

Document Type

Article

Publication Date

1-1-2021

Abstract

Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.

Journal Title

IEEE access : practical innovations, open solutions

Journal ISSN

2169-3536

Volume

First Page

5497

Last Page

5516

Digital Object Identifier (DOI)

10.1109/ACCESS.2020.3047588

Link to Full Text

Find in your library

COinS

Faculty Articles

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Department

Document Type

Publication Date

Abstract

Journal Title

Journal ISSN

Volume

First Page

Last Page

Digital Object Identifier (DOI)

Search

Authors

Browse

Useful Links

Faculty Articles

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Authors

Department

Document Type

Publication Date

Abstract

Journal Title

Journal ISSN

Volume

First Page

Last Page

Digital Object Identifier (DOI)

Share

Search

Authors

Browse

Useful Links