How can one share data for open science?

+16 votes
881 views
asked Aug 4, 2015 in Open Science by m0nhawk (270 points)

For sharing code, open scientists can use GitHub, Bitbucket or other source control system.

In contrast, data sets can be pretty large and have various schemes and formats. How can I ensure that the open science research has both open code and open data?



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by HDE 226868 (320 points)
@dendragon See [this](http://meta.openscience.stackexchange.com/questions/3/overlapping-with-opendata-se/6#6).

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Alexander Konovalov (135 points)
If you think that this thread should be migrated to OpenData or another SE site because the OpenScience beta is closing, please edit the list of questions shortlisted for the migration [here](http://meta.openscience.stackexchange.com/questions/73/).

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by dendragon (0 points)
Does this question not belong to http://opendata.stackexchange.com/?

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

10 Answers

+8 votes
answered Aug 4, 2015 by Tom Hardwicke (155 points)

There are many options for sharing general scientific data and code such as Github, Figshare, and Dataverse. There are also some discipline specific services, such as the OpenfMRI project for neuroimaging data.

If you want to share both data and code, and keep them together, then from personal experience I can recommend The Open Science Framework. You create a 'project' for your study with a series of 'components' that can represent code, data, manuscripts, protocols, and pretty much anything else you can think of.

Amongst other features the site has built-in version control, and you can plug-in various external services, including those mentioned above. Another nice feature is the ability to initially keep your project private and share it via view-only links (e.g., to share with collaborators, reviewers, prior to publication). It is also entirely free and run by a non-profit (COS). There is a preservation fund to ensure your data, code etc, will survive, even if the company does not.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by jaipel (65 points)
It's also important to note that OpenfMRI and OSF are both specific deployments of open source applications. If you have the resources at your institution you could also roll your own deployment of XNAT/SciTran or the OSF.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+4 votes
answered Aug 4, 2015 by jojo (285 points)

If you don't want to rely on companies or institutions (like datahub, figshare or Dryad or many more) one way would be to simply include a torrent file in the git (or whatever control system) project.

In order to assure data integrity a checksum file can be added.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Jure Triglav (110 points)
If you go the torrents route, there exists a public tracker dedicated to research datasets: http://academictorrents.com/

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by jojo (285 points)
@GavinSimpson Thanks for pointing it out (the 6 chars rule is not always very useful indeed).

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+3 votes
answered Aug 4, 2015 by HDE 226868 (320 points)

The Royal Society has an open science journal, part of which is about sharing open data. They suggest:

Datasets should be deposited in an appropriate, recognized, publicly available repository.

DataCite has a list of repositories for different fields. It is, however, simply a list.

An example of an open science data repository is GenBank, operated by the NIH to store publicly available DNA sequences. Government repositories like GenBank are generally well-maintained because the data contained within is quiet valuable (scientifically). They are certainly good choices, provided that you can find one that specializes in the topic you are working on. The NIH does have other repositories for different subjects, so it is a good choice.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+3 votes
answered Aug 4, 2015 by Daniel Standage (420 points)

Aside from domain-specific databases like GenBank, generic data repositories like figshare and Dryad are a great choice for persistent storage of open science research outputs. Both allow anyone to create accounts and upload/manage large data files.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+3 votes
answered Aug 4, 2015 by m0nhawk (270 points)

There is a project DataHub is powered by CKAN and can be used to share and publish data online.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Gram (185 points)
DataHub is amazing!

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+2 votes
answered Aug 4, 2015 by Karsten 7. (40 points)

For large datasets the Open Science Data Cloud (OSDC) provides resources for storing, sharing, and analyzing scientific datasets. One has to fill out a short proposal to get an OSDC resource allocation for ones project. Allocations start at 16 dedicated cores and 1TB of storage.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+2 votes
answered Aug 4, 2015 by Franck Dernoncourt (540 points)

This question has been addressed on OpenData SE, it might give interesting pointers:


Excerpt from the links (cc by-sa 3.0 with attribution required, user http://opendata.stackexchange.com/users/881/badroit):

Suppose that I have some sort of specialized data, perhaps that I've collected myself or been a part of the collection. And suppose that nothing prevents me from handing this data out to people. In what method should I go about distributing/storing this data so that others will be able to find it and use it, whenever this time may be?

Targeting specialised repositories as per @Joe's answer is indeed an excellent way to go about disseminating data, but what if no such specialised repository exists or you do not wish to target only one specific community in particular?

A methodology to expose Open Data using generic principles is the 5-star Open Data scheme originally proposed by Tim Berners-Lee here.

The core rationale of 5-star Open Data is that you make your data more easily accessible, processable and interoperable with each successive star:

Put your data on the Web in some format with an Open Licence. People can access it through their browsers and spend some time to figure out how they can download/access/process/use it. (Avoid problems for your client like this.)

★★ Put your data in a machine-processable format. For example, having a table in Excel is better than having a snapshot printed in PDFs or images because people can download it and start running experiments over it. (Avoid problems like this.)

★★★ Use non-proprietary formats. For example, providing data as a CSV is often better than as an Excel file because CSV can be directly processed by a wider range of (free/open source) tools and programming languages. (Can't find anyone complaining about Excel on here yet but, e.g., this is a similar problem.)

★★★★ Use URIs to denote things. For example, let's say you provide a bunch of pollution measures for cities and somebody would like to specifically reference the pollution measure for London. Assigning a URI for London in your local data provides a global unique identifier for that city that people can reference and point to. There are, for example, related proposals for embedding URI fragment identifiers in CSV files. (Avoid problems like this or this.)

★★★★★ Link your data to other data to provide context. So you have created a URI for London in your data and people can point to it. However, which London are you referring to? London, England or London, Ontario? If you link your local URI for London to the Wikipedia page about the London to which you refer (or, even better, to the DBpedia URI for the specific place to which you prefer), this provides context as to what you mean. (Avoid problems like this.)

The shift from ★★★ to ★★★★(★) is quite an ambitious one and technical proposals are still being made on how best to achieve this, but five star Open Data is great because now your data are available on the Web under open licences with open structured formats where everything of importance is given a URI that can be referenced and linked across the Web, allowing for future discovery and re-use. A common methodology to create five star Open Data (again proposed by Tim Berners-Lee) is Linked Data, which assumes RDF as a common interoperable data format. But if that all sounds too much, getting as far as ★★★ data is still great.

Again, you can check out this description of 5 Star Open Data for more information and a related question here.

A useful resource for the generic cataloguing of Open Datasets is the CKAN project, where the related DataHub repository is a great place to list and publicise your dataset. You can check out a bunch of 5-star Open Datasets here.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by HDE 226868 (320 points)
This isn't really an answer, just a collection of links *to* answers.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Franck Dernoncourt (540 points)
@HDE226868 Do you want me to copy paste the content of the links?

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by HDE 226868 (320 points)
If you want to, then yes. Otherwise, it's a link-only answer.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Franck Dernoncourt (540 points)
@HDE226868 Done

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+1 vote
answered Aug 7, 2015 by kenorb (430 points)

CKAN

If you want to make your data open and available, you can consider CKAN which is the open source data portal software to make your data discoverable and presentable where each dataset is given its own page with a rich collection of metadata, making it a valuable and easily searchable resource. Check the demo.

This solution is already used by private and government organisation/entities such as case studies:

GitHub

If you are used to code repositories and want to keep both open code and open data in the same place, consider GitHub and its Git extension for versioning large files - Git Large File Storage (LFS). In this way you can version large files (even those as large as a couple GB) with Git.

On OSX you an easily install it via: brew install git-lfs.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+1 vote
answered Aug 8, 2015 by fsolt (10 points)

@Tom Hardwicke has already mentioned Dataverse in passing, but it deserves a real plug. I'd guess it especially deserves your consideration if you're working in the social sciences--it definitely led the way in open data in political science (my discipline) a decade or so ago--but plenty of others use it too (it's among the repositories recommended by PLOS, for example). It provides persistent identifiers (Handle, DOI) for each dataset, archives old versions, and provides an easy way to track which files have changed across versions.

I've had a data-sharing project on Dataverse since 2008, and I've found it to be a very good platform for getting my work, both data and code, into the hands of other researchers.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+1 vote
answered Aug 11, 2015 by Benteh (10 points)

I would like to add a Mellon-founded project being developed as we speak: vega publishing

It is in early development, and they are open for suggestions. I think you/we should engage them here and/or send them suggestions for how it can and should be made possible.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Welcome to Open Science Q&A, where you can ask questions and receive answers from other members of the community.

If you participated in the Open Science beta at StackExchange, please reclaim your user account now – it's already here!

e-mail the webmaster

...