Proteomics

Dataset Information

0

Aird: A computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time


ABSTRACT: We describe "Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies, and high compression rate. Aird provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data. Compared with Zlib only, m/z data size is about 55% lower in Aird on average. With the high-speed decoding and encoding performance brought by the Single Instruction Multiple Data(SIMD) technology used in the ZDPD, Aird merely takes 33% decoding time compared with Zlib. We used the open dataset HYE, which contains 48 raw files from SCIEX TripleTOF 5600 and TripleTOF6600. The total file size is 206GB as the vendor format. The total size increases to 854GB after converting to mzML with 32-bit encoding precision. While it takes only 189GB when using Aird. Aird uses JavaScript Object Notation (JSON) for metadata storage. Aird-SDK is written in Java and AirdPro is a GUI client for vendor file converting which is written in C#. They are freely available at https://github.com/CSi-Studio/Aird-SDK and https://github.com/CSi-Studio/AirdPro

ORGANISM(S): Homo Sapiens

SUBMITTER: Miaoshan Lu  

PROVIDER: PXD025310 | iProX | Mon Apr 12 00:00:00 BST 2021

REPOSITORIES: iProX

altmetric image

Publications

Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time.

Lu Miaoshan M   An Shaowei S   Wang Ruimin R   Wang Jinyin J   Yu Changbin C  

BMC bioinformatics 20220112 1


<h4>Background</h4>With the precision of the mass spectrometry (MS) going higher, the MS file size increases rapidly. Beyond the widely-used open format mzML, near-lossless or lossless compression algorithms and formats emerged in scenarios with different precision requirements. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focus more on lossless compression rate, computation-oriented formats concentrate as much  ...[more]

Similar Datasets

2021-04-12 | PXD025142 | Pride
| PRJNA874807 | ENA
2007-09-19 | GSE5594 | GEO
2007-09-19 | GSE5595 | GEO
| PRJNA602526 | ENA
2025-09-22 | PXD062853 | Pride
| PRJNA675630 | ENA
| PRJNA1132232 | ENA
| PRJEB38829 | ENA
| PRJNA926914 | ENA