Tackling the Challenges of FASTQ Referential Compression
ABSTRACT:The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their nonreferential counterparts;...
- Autores:
-
Guerra Soler, Aníbal José
Lotero García, Jaime Andrés
Aedo Cobo, José Édinson
Isaza Ramírez, Sebastián
- Tipo de recurso:
- Article of investigation
- Fecha de publicación:
- 2019
- Institución:
- Universidad de Antioquia
- Repositorio:
- Repositorio UdeA
- Idioma:
- eng
- OAI Identifier:
- oai:bibliotecadigital.udea.edu.co:10495/35246
- Acceso en línea:
- https://hdl.handle.net/10495/35246
- Palabra clave:
- Bioinformática
Bioinformatics
Compresión de datos (computadores)
Data compression (computer science)
Teoría de la codificación
Coding theory
- Rights
- openAccess
- License
- http://creativecommons.org/licenses/by-nc/2.5/co/
| id |
UDEA2_cd769a6de9e43f10f67cdddfb9c26796 |
|---|---|
| oai_identifier_str |
oai:bibliotecadigital.udea.edu.co:10495/35246 |
| network_acronym_str |
UDEA2 |
| network_name_str |
Repositorio UdeA |
| repository_id_str |
|
| dc.title.spa.fl_str_mv |
Tackling the Challenges of FASTQ Referential Compression |
| title |
Tackling the Challenges of FASTQ Referential Compression |
| spellingShingle |
Tackling the Challenges of FASTQ Referential Compression Bioinformática Bioinformatics Compresión de datos (computadores) Data compression (computer science) Teoría de la codificación Coding theory |
| title_short |
Tackling the Challenges of FASTQ Referential Compression |
| title_full |
Tackling the Challenges of FASTQ Referential Compression |
| title_fullStr |
Tackling the Challenges of FASTQ Referential Compression |
| title_full_unstemmed |
Tackling the Challenges of FASTQ Referential Compression |
| title_sort |
Tackling the Challenges of FASTQ Referential Compression |
| dc.creator.fl_str_mv |
Guerra Soler, Aníbal José Lotero García, Jaime Andrés Aedo Cobo, José Édinson Isaza Ramírez, Sebastián |
| dc.contributor.author.none.fl_str_mv |
Guerra Soler, Aníbal José Lotero García, Jaime Andrés Aedo Cobo, José Édinson Isaza Ramírez, Sebastián |
| dc.contributor.researchgroup.spa.fl_str_mv |
Sistemas Embebidos e Inteligencia Computacional (SISTEMIC) |
| dc.subject.lemb.none.fl_str_mv |
Bioinformática Bioinformatics Compresión de datos (computadores) Data compression (computer science) Teoría de la codificación Coding theory |
| topic |
Bioinformática Bioinformatics Compresión de datos (computadores) Data compression (computer science) Teoría de la codificación Coding theory |
| description |
ABSTRACT:The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their nonreferential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools. |
| publishDate |
2019 |
| dc.date.issued.none.fl_str_mv |
2019 |
| dc.date.accessioned.none.fl_str_mv |
2023-06-01T18:27:29Z |
| dc.date.available.none.fl_str_mv |
2023-06-01T18:27:29Z |
| dc.type.spa.fl_str_mv |
Artículo de investigación |
| dc.type.coar.spa.fl_str_mv |
http://purl.org/coar/resource_type/c_2df8fbb1 |
| dc.type.redcol.spa.fl_str_mv |
https://purl.org/redcol/resource_type/ART |
| dc.type.coarversion.spa.fl_str_mv |
http://purl.org/coar/version/c_970fb48d4fbd8a85 |
| dc.type.driver.spa.fl_str_mv |
info:eu-repo/semantics/article |
| dc.type.version.spa.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| format |
http://purl.org/coar/resource_type/c_2df8fbb1 |
| status_str |
publishedVersion |
| dc.identifier.citation.spa.fl_str_mv |
Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. Erratum in: Bioinform Biol Insights. 2019 Sep 17;13:1177932219876803. |
| dc.identifier.issn.none.fl_str_mv |
1177-9322 |
| dc.identifier.uri.none.fl_str_mv |
https://hdl.handle.net/10495/35246 |
| dc.identifier.doi.none.fl_str_mv |
10.1177/1177932218821373 |
| identifier_str_mv |
Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. Erratum in: Bioinform Biol Insights. 2019 Sep 17;13:1177932219876803. 1177-9322 10.1177/1177932218821373 |
| url |
https://hdl.handle.net/10495/35246 |
| dc.language.iso.spa.fl_str_mv |
eng |
| language |
eng |
| dc.relation.ispartofjournalabbrev.spa.fl_str_mv |
Bioinform. Biol. Insights. |
| dc.relation.citationendpage.spa.fl_str_mv |
19 |
| dc.relation.citationstartpage.spa.fl_str_mv |
1 |
| dc.relation.citationvolume.spa.fl_str_mv |
13 |
| dc.relation.ispartofjournal.spa.fl_str_mv |
Bioinformatics and Biology Insights |
| dc.rights.uri.*.fl_str_mv |
http://creativecommons.org/licenses/by-nc/2.5/co/ |
| dc.rights.uri.spa.fl_str_mv |
https://creativecommons.org/licenses/by-nc/4.0/ |
| dc.rights.accessrights.spa.fl_str_mv |
info:eu-repo/semantics/openAccess |
| dc.rights.coar.spa.fl_str_mv |
http://purl.org/coar/access_right/c_abf2 |
| rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc/2.5/co/ https://creativecommons.org/licenses/by-nc/4.0/ http://purl.org/coar/access_right/c_abf2 |
| eu_rights_str_mv |
openAccess |
| dc.format.extent.spa.fl_str_mv |
19 |
| dc.format.mimetype.spa.fl_str_mv |
application/pdf |
| dc.publisher.spa.fl_str_mv |
SAGE Publications |
| dc.publisher.place.spa.fl_str_mv |
Thousand Oaks, Estados Unidos |
| institution |
Universidad de Antioquia |
| bitstream.url.fl_str_mv |
https://bibliotecadigital.udea.edu.co/bitstreams/f55cfd59-6944-490e-b872-aabff66d9b21/download https://bibliotecadigital.udea.edu.co/bitstreams/e672735c-8c4f-4e8a-bbca-06641095276a/download https://bibliotecadigital.udea.edu.co/bitstreams/dd0b5890-0231-46ce-9aae-6a276c5ab287/download https://bibliotecadigital.udea.edu.co/bitstreams/b7cd0dc9-819e-4c83-ad9f-83601bab1686/download https://bibliotecadigital.udea.edu.co/bitstreams/d4f0b34d-f68f-487d-bc88-4a08516183e8/download |
| bitstream.checksum.fl_str_mv |
f1a9e48a62c0e9b196eca13fdbb0d04e c0c92b0ffc8b7d22d9cf56754a416a76 8a4605be74aa9ea9d79846c1fba20a33 c68f02f067fdd49f7c85fd3e56152e98 cea13b9353c74f1a574942199bbf6e8e |
| bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 MD5 |
| repository.name.fl_str_mv |
Repositorio Institucional de la Universidad de Antioquia |
| repository.mail.fl_str_mv |
aplicacionbibliotecadigitalbiblioteca@udea.edu.co |
| _version_ |
1851052503046553600 |
| spelling |
Guerra Soler, Aníbal JoséLotero García, Jaime AndrésAedo Cobo, José ÉdinsonIsaza Ramírez, SebastiánSistemas Embebidos e Inteligencia Computacional (SISTEMIC)2023-06-01T18:27:29Z2023-06-01T18:27:29Z2019Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. Erratum in: Bioinform Biol Insights. 2019 Sep 17;13:1177932219876803.1177-9322https://hdl.handle.net/10495/3524610.1177/1177932218821373ABSTRACT:The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their nonreferential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.COL001071719application/pdfengSAGE PublicationsThousand Oaks, Estados Unidoshttp://creativecommons.org/licenses/by-nc/2.5/co/https://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Tackling the Challenges of FASTQ Referential CompressionArtículo de investigaciónhttp://purl.org/coar/resource_type/c_2df8fbb1https://purl.org/redcol/resource_type/ARThttp://purl.org/coar/version/c_970fb48d4fbd8a85info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionBioinformáticaBioinformaticsCompresión de datos (computadores)Data compression (computer science)Teoría de la codificaciónCoding theoryBioinform. Biol. Insights.19113Bioinformatics and Biology InsightsPublicationORIGINALGuerraAnibal_2019_TacklingChallenges.pdfGuerraAnibal_2019_TacklingChallenges.pdfArtículo de investigaciónapplication/pdf2201439https://bibliotecadigital.udea.edu.co/bitstreams/f55cfd59-6944-490e-b872-aabff66d9b21/downloadf1a9e48a62c0e9b196eca13fdbb0d04eMD51trueAnonymousREADCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8933https://bibliotecadigital.udea.edu.co/bitstreams/e672735c-8c4f-4e8a-bbca-06641095276a/downloadc0c92b0ffc8b7d22d9cf56754a416a76MD52falseAnonymousREADLICENSElicense.txtlicense.txttext/plain; charset=utf-81748https://bibliotecadigital.udea.edu.co/bitstreams/dd0b5890-0231-46ce-9aae-6a276c5ab287/download8a4605be74aa9ea9d79846c1fba20a33MD53falseAnonymousREADTEXTGuerraAnibal_2019_TacklingChallenges.pdf.txtGuerraAnibal_2019_TacklingChallenges.pdf.txtExtracted texttext/plain90825https://bibliotecadigital.udea.edu.co/bitstreams/b7cd0dc9-819e-4c83-ad9f-83601bab1686/downloadc68f02f067fdd49f7c85fd3e56152e98MD54falseAnonymousREADTHUMBNAILGuerraAnibal_2019_TacklingChallenges.pdf.jpgGuerraAnibal_2019_TacklingChallenges.pdf.jpgGenerated Thumbnailimage/jpeg16649https://bibliotecadigital.udea.edu.co/bitstreams/d4f0b34d-f68f-487d-bc88-4a08516183e8/downloadcea13b9353c74f1a574942199bbf6e8eMD55falseAnonymousREAD10495/35246oai:bibliotecadigital.udea.edu.co:10495/352462025-03-26 23:25:50.396http://creativecommons.org/licenses/by-nc/2.5/co/open.accesshttps://bibliotecadigital.udea.edu.coRepositorio Institucional de la Universidad de Antioquiaaplicacionbibliotecadigitalbiblioteca@udea.edu.coTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo= |
