Why should I avoid sharing data in proprietary formats?

+8 votes
78 views
asked Aug 17, 2015 in Open Science by Thomas (915 points)

One often hears that CSV (or similar delimited text formats) is the best for sharing tabular data structures. Yet other formats, such as the proprietary formats used by Stata, SAS, SPSS, etc. allow much greater information (such as variable and value labels and other forms of metadata). If these formats provide these additional features and most contemporary statistical software can read and write these formats, why should we rely on CSV for data sharing?



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Simon W (155 points)
Just a note that there are open formats that are not text - e.g. NetCDF, and probably others, which are binary formats with published specifications that are free for all to use. They have pros and cons compared to text, just like closed proprietary formats.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Zizouz212 (320 points)
I wouldn't consider the fact that sharing in a proprietary format is an issue. It's a lot like criteria for Open Knowledge in that sense: (If I remember correctly) data can be considered "open" if it is released in a format that can be read by at least one Free/Libre/Open Source software tool. I'll check on that.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

2 Answers

+4 votes
answered Aug 17, 2015 by Zizouz212 (320 points)
 
Best answer

Proprietary data formats aren't the issue: It's the ease of accessibility to the data they hold that is.

In simple terms, a file format is basically anything that can store data in such a way so that in can be used for various, respective purposes. How can storing the data in such a way be an issue? The thing is, that's not the issue. The biggest issue to this is the ability to access the data that is contained within the file format:

Let's take a look at what our trusty Open Definition says: (Emphasis mine)

An Open Format for Data - Definition 2

An Open Format is a format that, “can be processed with at least one free/libre/open-source software tool”.

The concern that people have with proprietary formats is that there are obstacles in accessing the data stored in the format. For the sake of argument, let's say that I create a data format, the .ziz format. It's a fantastic format, capable of storing hundreds of rows of data, across multiple categories, and compresses it in such a way no quality is lost, and as such doesn't take loads of space on a computer. I create a program to access the data, the Ziz Reader. I put it on sale for 50 dollars.

What's the issue in this? That I need to pay in order to access the data stored in the .ziz format. The key issue is not with the format itself, but the lack of a suitable program readily available to access the data contained in the format.

While the lack of free, available programs is an issue, don't put the format to blame. There is a connection between proprietary formats and ease of access, but the problem is access. My friend Bill has to pay 50 dollars just to see what the answer of 1 plus 1 is.

Should I discourage the use of proprietary formats?

If there is an open format that does what you would like, then by all means you should use it. Don't go crazy about it. If the best means of storing and sharing the data is through means of a proprietary format, it's not that big of an issue. Generally, if you need to use a proprietary format, try your best to use one that satisfies the second definition of an open format, as listed above: I can access the data with a free/libre/open-source software tool.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by mgoeminne (0 points)
I would say that a file format is a specification of anything that can store data… A proprietary specification can sometimes be implemented in an open source library, as you stated.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+4 votes
answered Aug 17, 2015 by Michael (140 points)

Because you want other people to be able to use your data. It not sharing just for put a check mark on "I shared the data". You actually want people to be able to open the data file, modify it and adapt it to their own needs. If you data file is in SPSS format and I do not have this software it becomes much harder. This is especially true when the software is not free - and this will mean that I have to pay to buy this software just to view your file.

There are file conversion tools to assist, but usually the result is less than optimal. Commonly, you will lose the metadata, which is the reason you used the proprietary format in the first place. This kind of misses the point then, so why bother?

In some cases a certain commercial software package is used by everyone in a specific field. In that case you can be pretty sure that whoever will want to use the data already has the software. That is not an excuse for using the proprietary format though. One of the things you want to be sure of, is that the data will be accessible perpetually. Proprietary formats change over time, and there is no promise that your file will be readable in 5, 10 or 20 years. CSV will always work - it's just text.

A possible compromise is to include both. The proprietary format and the same thing in CSV format. Then you get the benefits of both worlds. However, this may not be practical, depending on the size of your file.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Simon W (155 points)
+1 for the fact that formats change and shift over time - also ubiquitous proprietary packages eventually get superseded and die out, and so forth.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Wolf (0 points)
From my perspective, the *modify* option is not as important as filtering and "selecting" (columns), maybe this is also true for you?

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Welcome to Open Science Q&A, where you can ask questions and receive answers from other members of the community.

If you participated in the Open Science beta at StackExchange, please reclaim your user account now – it's already here!

e-mail the webmaster

...