Saturday, February 26, 2011

Sustainable Scientific Data Archiving Model

As many researchers may have noticed, NCBI plans to discontinue the Short Read Archive (SRA) service due to budget constraints. This news surprises me, and, I believe, concerns the broad biomedical research community in general. While the biomedial research enters the -omics era and becomes more and more data-driven, the sudden close of SRA raises the question that how the scientific data be archived with a sustainable model? I discuss two strategies to preserve scientific data in a sustainable manner. The first proposes a central data repository that charges data deposition fee. The other approach proposes that the data is stored in P2P manner and a central gateway gathers metadata and tracks links to P2P seeds.

Sustainable data archiving model includes the following aspects: first the data should include necessary and accurate metadata; second, the data should be stored securely and remains authentic and correct for a long time; third the data should also include essential softwares and scripts to analyze the data; finally the data should be easily searched and accessed by the broader research community now and for a considerate period in the future so that researchers may use the dataset from different perspectives and even re-analyze the data in the future if new hypothesis and analytic methods emerges.

However as has been note elsewhere, there is a disconnection between the effort to produce the data and the effort to preserve the data. Simply put, funding agencies provide the money for produce the money but not the money to maintain the data. The grant for producing the data is in rather smaller time sale, usually two to five years. Once the grant is over, the project is done and the original researchers switched to other projects, the data produced is in the danger of being lost. Fortunately the biomedical research community has a pretty good record in depositing biological datasets for public research as has been exemplified by GeneBank and GEO. The Short Read Archive is designed to meet the requirements of the massively parallel sequencing reads data. However the discontinuance of this services demonstrates the uncertainty of current data sharing model due to lack of specific funding. Therefore I am considering the following two strategies for the sustain scientific data.

In the first strategy, we still rely on a central data repository like SRA that curates, stores and distributes biological datasets. To meet the financial requirement of such central repository, it charges certain amount of fee for the data hosted. It works as following: when the original data producer finish their research and submit a paper to a journal. The journal requires that their data is deposited in a certain repository and charges data deposition fee. Next the journal allocates the major part of the data deposition fee to the central data repository. The proposed data deposition fee is charged only once which can therefore be covered by the initial grant of the original data producer. With the ever-decreasing cost of data storage, the continual influx of single-time data deposition fee should keep the central data repository working.

The second strategy is initially brought up to me by my friend Li Xia and are further inspired by Morgan Langille, the creator of BioTorrents. In the strategy, the data set is stored by multiple hosts
who may have the resources and interest to keep the dataset. Next a central gateway keeps tracks of the BitTorrents seeds to the raw data and also stores the metadata associated with each data, such as the contact of data producer, experimental protocols and descriptions of the raw data. Especially the central gateway stores the version of the raw dataset and the MD5 or SHA sum for the data so that the data users can make sure they are obtaining updated and authentic dataset from essentially unreliable and untrustable data hosts in a P2P network. Since the central gateway needs only to track these metadata, its running cost is significantly smaller than the central data repository and therefore it can work just as a new section in the NCBI infrastructure.

I hope this discussion publicize the urgency for sustainable scientific data archiving so that the biomedical research community will work out a way after SRA ends.