Genome assembly composition of the string

Raja Farhana Raja Khairuddin

QR Code Link :
Type :	Article
Subject :	Q Science (General)
ISSN :	2376-5992
Main Author :	Raja Farhana Raja Khairuddin
Title :	Genome assembly composition of the string "ACGT" array: a review of data structure accuracy and performance challenges
Hits :	337

Place of Production :	Tanjung Malim
Publisher :	Fakulti Sains dan Matematik
Year of Publication :	2023
Notes :	PeerJ Computer Science
Corporate Name :	Universiti Pendidikan Sultan Idris
HTTP Link :	Click to view web link
PDF Full Text :	Login required to access this item.

Abstract : Universiti Pendidikan Sultan Idris

Background. The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The de novo approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge. Method. The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article's primary aim and contribution are to support the researchers through an extensive review to ease other researchers' search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization. Results. Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with de novo or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach. Conclusion. We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance. Copyright 2023 Magdy Mohamed Abdelaziz Barakat et al.

References

Acuña Amador L, Primot A, Cadieu E, Roulet A, Barloy-Hubler F. 2018. Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains. BMC Genomics 19(1):1–24 DOI 10.1186/s12864-017-4368-0.

Angeleska A, Kleessen S, Nikoloski Z. 2014. The sequence reconstruction problem. In: Discrete and topological models in molecular biology. Berlin, Heidelberg: Springer, 23–43.

Baichoo S, Ouzounis CA. 2017. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment. Biosystems 156:72–85.

Barsky M, Stege U, Thomo A, Upton C. 2009. Suffix trees for very large genomic sequences. In: Proceedings of the 18th ACM conference on information and knowledge management. New York: ACM, 1417–1420.

Baxevanis AD. 2020. Biological sequence databases. Bioinformatics. 4th Edition. New York: John Wiley & Sons, 1–18.

Bayat A, Deshpande NP, Wilkins MR, Parameswaran S. 2018. Fast short read de-novo assembly using overlap-layout-consensus approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics 17(1):334–338.

Berztiss A. 2014. Data structures;: theory and practice (Computer science and applied mathematics). Ex-library. New York: Academic Press.

Brodsky L, Kogan S, Ben Jacob E, Nevo E. 2010. A binary search approach to wholegenome data analysis. Proceedings of the National Academy of Sciences of the United States of America 107(39):16893–16898 DOI 10.1073/pnas.1011134107.

Castro CJ, Ng TFF. 2017. U50: a new metric for measuring assembly output based on non-overlapping, target-specific contigs. Journal of Computational Biology 24(11):1071–1080 DOI 10.1089/cmb.2017.0013.

Chen Y, Zhang Y, Wang AY, Gao M, Chong Z. 2021. Accurate long-read de novo assembly evaluation with Inspector. Genome Biology 22(1):1–21 DOI 10.1186/s13059-020-02207-9.

Chen Z, Erickson DL, Meng J. 2020. Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing. BMC Genomics 21(1):1–21 DOI 10.1186/s12864-019-6419-1.

Chu C, Nielsen R, Wu Y. 2016. REPdenovo: inferring de novo repeat motifs from short sequence reads. PLOS ONE 11(3):e0150719 DOI 10.1371/journal.pone.0150719.

Ekblom R, Wolf JB. 2014. A field guide to whole-genome sequencing, assembly and annotation. Evolutionary Applications 7(9):1026–1042 DOI 10.1111/eva.12178.

Ellis M, Georganas E, Egan R, Hofmeyr S, Buluc A, Cook B, Oliker L, Yelick K. 2017. Performance characterization of de novo genome assembly on leading parallel systems. In: European conference on parallel processing. Cham: Springer, 79–91.

Garibyan L, Avashia N. 2013. Research techniques made simple: polymerase chain reaction (PCR). The Journal of Investigative Dermatology 133(3):e6 DOI 10.1038/jid.2012.454.

Genovese LM, Geraci F, Corrado L, Mangano E, D’Aurizio R, Bordoni R, Servergnini M, Manzini G, De Bellis G, D’Alfonso S, Pellegrini M. 2018. A census of Tandemly repeated polymorphic loci in genic regions through the comparative integration of human genome assemblies. Frontiers in Genetics 9:155 DOI 10.3389/fgene.2018.00155.

Giordano F, Stammnitz MR, Murchison EP, Ning Z. 2018. scanPAV: a pipeline for extracting presence—absence variations in genome pairs. Bioinformatics 34(17):3022–3024 DOI 10.1093/bioinformatics/bty189.

Girgis HZ. 2015. Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinformatics 16(1):227 DOI 10.1186/s12859-015-0654-5.

Gopinath GR, Cinar HN, Murphy HR, Durigan M, Almeria M, Tall BD, DaSilva AJ. 2018. A hybrid reference-guided de novo assembly approach for generating Cyclospora mitochondrion genomes. Gut Pathogens 10:15 DOI 10.1186/s13099-018-0242-0.

Guiglielmoni N, Houtain A, Derzelle A, Van Doninck K, Flot JF. 2021. Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms. BMC Bioinformatics 22(1):1–2 DOI 10.1186/s12859-020-03881-z.

Guo R, Li YR, He S, Ou-Yang L, Sun Y, Zhu Z. 2018. RepLong: de novo repeat identification using long read sequencing data. Bioinformatics 34(7):1099–1107 DOI 10.1093/bioinformatics/btx717.

Haj Rachid M. 2017. Two efficient techniques to find approximate overlaps between sequences. BioMed Research International 2017:1–8 DOI 10.1155/2017/2731385.

Haj Rachid M, Malluhi Q. 2015. A practical and scalable tool to find overlaps between sequences. BioMed Research International 2015:1–12 DOI 10.1155/2015/905261.

Jain M, Olsen HE, Turner DJ, Stoddart D, Bulazel KV, Paten B, Haussler D, Willard HF, Akeson M, Miga KH. 2018a. Linear assembly of a human centromere on the Y chromosome. Nature Biotechnology 36(4):321–323 DOI 10.1038/nbt.4109.

Jain M, Olsen HE, Turner DJ, Stoddart D, Bulazel KV, Paten B, Haussler D, Willard HF, Akeson M, Miga KH. 2018b. Linear assembly of a human centromere on the Y chromosome. Nature Biotechnology 36(4):321–323 DOI 10.1038/nbt.4109.

Kim J, Ji M, Yi G. 2020. A review on sequence alignment algorithms for short reads based on next-generation sequencing. IEEE Access 8:189811–189822 DOI 10.1109/ACCESS.2020.3031159.

Kulkarni P, Frommolt P. 2017. Challenges in the setup of large-scale next-generation sequencing analysis workflows. Computational and Structural Biotechnology Journal 15:471–477 DOI 10.1016/j.csbj.2017.10.001.

Labeit J, Shun J, Blelloch GE. 2017. Parallel lightweight wavelet tree, suffix array and FMindex construction. Journal of Discrete Algorithms 43:2–17 DOI 10.1016/j.jda.2017.04.001.

Lian S, Li Q, Dai Z, Xiang Q, Dai X. 2014. Ade novogenome assembly algorithm for repeats and nonrepeats. BioMed Research International 2014:736473 DOI 10.1155/2014/736473.

Liao X, Li M, Hu K, Wu FX, Gao X, Wang J. 2021. A sensitive repeat identification framework based on short and long reads. Nucleic Acids Research 49(17):e100 DOI 10.1093/nar/gkab563.

Liao X, Zhang X, Wu FX, Wang J. 2019. De novo repeat detection based on the third generation sequencing reads. In: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM). Piscataway: IEEE, 431–436.

Libbrecht MW, Noble WS. 2015. Machine learning applications in genetics and genomics. Nature Reviews Genetics 16(6):321–332 DOI 10.1038/nrg3920.

Lischer HE, Shimizu KK. 2017. Reference-guided de novo assembly approach improves genome reconstruction for related species. BMC Bioinformatics 18(1):474 DOI 10.1186/s12859-017-1911-6.

Liu Y, Yu Z, Dinger ME, Li J. 2018. Index suffix—prefix overlaps by (w, k)- minimizer to generate long contigs for reads compression. Bioinformatics 35(12):2066–2074.

Lohmann K, Klein C. 2014. Next generation sequencing and the future of genetic diagnosis. Neurotherapeutics 11(4):699–707 DOI 10.1007/s13311-014-0288-8.

This material may be protected under Copyright Act which governs the making of photocopies or reproductions of copyrighted materials.
You may use the digitized material for private study, scholarship, or research.

Back to search page