Tackling the Challenges of FASTQ Referential Compression

ABSTRACT:The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their nonreferential counterparts;...

Full description

Autores:
Guerra Soler, Aníbal José
Lotero García, Jaime Andrés
Aedo Cobo, José Édinson
Isaza Ramírez, Sebastián
Tipo de recurso:
Article of investigation
Fecha de publicación:
2019
Institución:
Universidad de Antioquia
Repositorio:
Repositorio UdeA
Idioma:
eng
OAI Identifier:
oai:bibliotecadigital.udea.edu.co:10495/35246
Acceso en línea:
https://hdl.handle.net/10495/35246
Palabra clave:
Bioinformática
Bioinformatics
Compresión de datos (computadores)
Data compression (computer science)
Teoría de la codificación
Coding theory
Rights
openAccess
License
http://creativecommons.org/licenses/by-nc/2.5/co/
id UDEA2_cd769a6de9e43f10f67cdddfb9c26796
oai_identifier_str oai:bibliotecadigital.udea.edu.co:10495/35246
network_acronym_str UDEA2
network_name_str Repositorio UdeA
repository_id_str
dc.title.spa.fl_str_mv Tackling the Challenges of FASTQ Referential Compression
title Tackling the Challenges of FASTQ Referential Compression
spellingShingle Tackling the Challenges of FASTQ Referential Compression
Bioinformática
Bioinformatics
Compresión de datos (computadores)
Data compression (computer science)
Teoría de la codificación
Coding theory
title_short Tackling the Challenges of FASTQ Referential Compression
title_full Tackling the Challenges of FASTQ Referential Compression
title_fullStr Tackling the Challenges of FASTQ Referential Compression
title_full_unstemmed Tackling the Challenges of FASTQ Referential Compression
title_sort Tackling the Challenges of FASTQ Referential Compression
dc.creator.fl_str_mv Guerra Soler, Aníbal José
Lotero García, Jaime Andrés
Aedo Cobo, José Édinson
Isaza Ramírez, Sebastián
dc.contributor.author.none.fl_str_mv Guerra Soler, Aníbal José
Lotero García, Jaime Andrés
Aedo Cobo, José Édinson
Isaza Ramírez, Sebastián
dc.contributor.researchgroup.spa.fl_str_mv Sistemas Embebidos e Inteligencia Computacional (SISTEMIC)
dc.subject.lemb.none.fl_str_mv Bioinformática
Bioinformatics
Compresión de datos (computadores)
Data compression (computer science)
Teoría de la codificación
Coding theory
topic Bioinformática
Bioinformatics
Compresión de datos (computadores)
Data compression (computer science)
Teoría de la codificación
Coding theory
description ABSTRACT:The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their nonreferential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.
publishDate 2019
dc.date.issued.none.fl_str_mv 2019
dc.date.accessioned.none.fl_str_mv 2023-06-01T18:27:29Z
dc.date.available.none.fl_str_mv 2023-06-01T18:27:29Z
dc.type.spa.fl_str_mv Artículo de investigación
dc.type.coar.spa.fl_str_mv http://purl.org/coar/resource_type/c_2df8fbb1
dc.type.redcol.spa.fl_str_mv https://purl.org/redcol/resource_type/ART
dc.type.coarversion.spa.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/article
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/publishedVersion
format http://purl.org/coar/resource_type/c_2df8fbb1
status_str publishedVersion
dc.identifier.citation.spa.fl_str_mv Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. Erratum in: Bioinform Biol Insights. 2019 Sep 17;13:1177932219876803.
dc.identifier.issn.none.fl_str_mv 1177-9322
dc.identifier.uri.none.fl_str_mv https://hdl.handle.net/10495/35246
dc.identifier.doi.none.fl_str_mv 10.1177/1177932218821373
identifier_str_mv Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. Erratum in: Bioinform Biol Insights. 2019 Sep 17;13:1177932219876803.
1177-9322
10.1177/1177932218821373
url https://hdl.handle.net/10495/35246
dc.language.iso.spa.fl_str_mv eng
language eng
dc.relation.ispartofjournalabbrev.spa.fl_str_mv Bioinform. Biol. Insights.
dc.relation.citationendpage.spa.fl_str_mv 19
dc.relation.citationstartpage.spa.fl_str_mv 1
dc.relation.citationvolume.spa.fl_str_mv 13
dc.relation.ispartofjournal.spa.fl_str_mv Bioinformatics and Biology Insights
dc.rights.uri.*.fl_str_mv http://creativecommons.org/licenses/by-nc/2.5/co/
dc.rights.uri.spa.fl_str_mv https://creativecommons.org/licenses/by-nc/4.0/
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
dc.rights.coar.spa.fl_str_mv http://purl.org/coar/access_right/c_abf2
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc/2.5/co/
https://creativecommons.org/licenses/by-nc/4.0/
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.extent.spa.fl_str_mv 19
dc.format.mimetype.spa.fl_str_mv application/pdf
dc.publisher.spa.fl_str_mv SAGE Publications
dc.publisher.place.spa.fl_str_mv Thousand Oaks, Estados Unidos
institution Universidad de Antioquia
bitstream.url.fl_str_mv https://bibliotecadigital.udea.edu.co/bitstreams/f55cfd59-6944-490e-b872-aabff66d9b21/download
https://bibliotecadigital.udea.edu.co/bitstreams/e672735c-8c4f-4e8a-bbca-06641095276a/download
https://bibliotecadigital.udea.edu.co/bitstreams/dd0b5890-0231-46ce-9aae-6a276c5ab287/download
https://bibliotecadigital.udea.edu.co/bitstreams/b7cd0dc9-819e-4c83-ad9f-83601bab1686/download
https://bibliotecadigital.udea.edu.co/bitstreams/d4f0b34d-f68f-487d-bc88-4a08516183e8/download
bitstream.checksum.fl_str_mv f1a9e48a62c0e9b196eca13fdbb0d04e
c0c92b0ffc8b7d22d9cf56754a416a76
8a4605be74aa9ea9d79846c1fba20a33
c68f02f067fdd49f7c85fd3e56152e98
cea13b9353c74f1a574942199bbf6e8e
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositorio Institucional de la Universidad de Antioquia
repository.mail.fl_str_mv aplicacionbibliotecadigitalbiblioteca@udea.edu.co
_version_ 1851052503046553600
spelling Guerra Soler, Aníbal JoséLotero García, Jaime AndrésAedo Cobo, José ÉdinsonIsaza Ramírez, SebastiánSistemas Embebidos e Inteligencia Computacional (SISTEMIC)2023-06-01T18:27:29Z2023-06-01T18:27:29Z2019Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. Erratum in: Bioinform Biol Insights. 2019 Sep 17;13:1177932219876803.1177-9322https://hdl.handle.net/10495/3524610.1177/1177932218821373ABSTRACT:The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their nonreferential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.COL001071719application/pdfengSAGE PublicationsThousand Oaks, Estados Unidoshttp://creativecommons.org/licenses/by-nc/2.5/co/https://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Tackling the Challenges of FASTQ Referential CompressionArtículo de investigaciónhttp://purl.org/coar/resource_type/c_2df8fbb1https://purl.org/redcol/resource_type/ARThttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionBioinformáticaBioinformaticsCompresión de datos (computadores)Data compression (computer science)Teoría de la codificaciónCoding theoryBioinform. Biol. Insights.19113Bioinformatics and Biology InsightsPublicationORIGINALGuerraAnibal_2019_TacklingChallenges.pdfGuerraAnibal_2019_TacklingChallenges.pdfArtículo de investigaciónapplication/pdf2201439https://bibliotecadigital.udea.edu.co/bitstreams/f55cfd59-6944-490e-b872-aabff66d9b21/downloadf1a9e48a62c0e9b196eca13fdbb0d04eMD51trueAnonymousREADCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8933https://bibliotecadigital.udea.edu.co/bitstreams/e672735c-8c4f-4e8a-bbca-06641095276a/downloadc0c92b0ffc8b7d22d9cf56754a416a76MD52falseAnonymousREADLICENSElicense.txtlicense.txttext/plain; charset=utf-81748https://bibliotecadigital.udea.edu.co/bitstreams/dd0b5890-0231-46ce-9aae-6a276c5ab287/download8a4605be74aa9ea9d79846c1fba20a33MD53falseAnonymousREADTEXTGuerraAnibal_2019_TacklingChallenges.pdf.txtGuerraAnibal_2019_TacklingChallenges.pdf.txtExtracted texttext/plain90825https://bibliotecadigital.udea.edu.co/bitstreams/b7cd0dc9-819e-4c83-ad9f-83601bab1686/downloadc68f02f067fdd49f7c85fd3e56152e98MD54falseAnonymousREADTHUMBNAILGuerraAnibal_2019_TacklingChallenges.pdf.jpgGuerraAnibal_2019_TacklingChallenges.pdf.jpgGenerated Thumbnailimage/jpeg16649https://bibliotecadigital.udea.edu.co/bitstreams/d4f0b34d-f68f-487d-bc88-4a08516183e8/downloadcea13b9353c74f1a574942199bbf6e8eMD55falseAnonymousREAD10495/35246oai:bibliotecadigital.udea.edu.co:10495/352462025-03-26 23:25:50.396http://creativecommons.org/licenses/by-nc/2.5/co/open.accesshttps://bibliotecadigital.udea.edu.coRepositorio Institucional de la Universidad de Antioquiaaplicacionbibliotecadigitalbiblioteca@udea.edu.coTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=