Why do many researchers choose not to publish code and data alongside their articles?

+13 votes
707 views
asked Aug 4, 2015 in Open Science by tomp (215 points)

What are the core reasons why researchers choose not to publish code and data alongside traditional research articles?



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by tomp (215 points)
Thanks for your comments, and I agree more specific would be better. Sorry, new poster to StackExchange. What is best response given there are already comments? Edit or delete?

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Simon W (155 points)
this is in danger of being a "list" question, I fear. But, maybe it won't develop that way :)

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Gram (185 points)
It is up to you to choose the best response, however if both answers are helpful you can vote them up. If a answer is incompleate or not detailed enough leave a comment and the poster (such as my mostly-a-list) can elaborate. :)

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by tomp (215 points)
Apologies, I changed the question somewhat in an attempt to be more specific. Considered deleting and starting again, which would have made things easier, but didn't want to lose your answers...

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Gavin Simpson (720 points)
This is pretty broad; do you have specifics? Perhaps ask as separate questions or rephrase this.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

8 Answers

+9 votes
answered Aug 11, 2015 by Neil Chue Hong (155 points)
 
Best answer

Victoria Stodden has undertaken research on this topic. In particular, in a 2010 survey of the machine learning community described in her talk What is Reproducible Research? The Practice of Science Today and the Scientific Method, she notes that the top reasons given were:

  • 77% Time to document and clean up
  • 52% Dealing with questions from users
  • 44% Not receiving attribution
  • 40% Possibility of patents
  • 34% Legal Barriers (ie. copyright)
  • 30% Potential loss of future publications
  • 30% Competitors may get an advantage
  • 20% Web/disk space limitations


This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by tomp (215 points)
Great answer, thanks Neil. Perhaps related to the way the survey was conducted, but I'm surprised that "Scared of someone spotting a problem" didn't make it to the list.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Richard Smith-Unna (0 points)
@tomp anecdotally, I have heard many colleagues imply that releasing code and data invites a level of scrutiny with which they are not comfortable.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Neil Chue Hong (155 points)
@tomp not 100% certain, but I believe those kind of answers would fall under "Time to document and clean up" and "Dealing with questions from users", which were the top two categories.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Neil Chue Hong (155 points)
I also wrote about the fear that being called out in public for publishing messy code might affect future job prospects in this article on a [prominent case of code shaming](http://www.software.ac.uk/blog/2013-01-25-haters-gonna-hate-why-you-shouldnt-be-ashamed-releasing-your-code).

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+12 votes
answered Aug 4, 2015 by Gram (185 points)

I am a big fan of open science here is a list of fears I have seen at my workplace:

  • Trade Secrets (Someone will put us out of business if they know this)
  • Hosting Costs (It is cheaper to keep this internal, then pay to have it exposed)
  • If it works why fix it (Fear of changes is certainly not a new thing)
  • Licencing Inexperience (It will take a lot of work and possible experience to figure this out)
  • Too many bosses (The amount of red tape to get this approved is not worth the effort)
  • Plagiarism (We don't want our work "Stolen" and if we have unwittingly used someone else's work, at least a limited set of eyes will see it so we should be safe.)

Edit: My answer was meant as a point-form summary if you would like I can go into detail on any of the topics listed.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Gram (185 points)
Sounds good, it is always nice to have a study to point to. :)

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by tomp (215 points)
Many thanks Gram. Close call but I ended up accepting Neil's answer because the survey provides helpful evidence to support the points.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+5 votes
answered Aug 4, 2015 by Guido Jorg (90 points)

Hebb wrote, not all issues are psychological, but most are. I agree with Gram's answer. Simon's answer is true not only for code but for mathematical proofs too. However there is one other major issue and it is psychological I suggest.

Many researchers tenured or not falsely think: open, therefore free, therefore worthless. Too many imagine they lose face in front of colleagues who publish in journals that are not open... (For the apparently valid reason that because if somebody pays to read an article it means they value it more and it has more value...)

This argument isn't applicable here. Most paywalled journals are read because a university subscribes to them. The target readers don't usually pay for them. Furthermore in science value is not judged by willingness to pay anyway.

Indeed, tenured colleagues really have no reason to prefer publications with more prestige. Some would even reply they don't know what that even means. For at that point, articles are looked up and read, as they become aware of their possible utility, not journals.

So the reason is often irrational one can suggest. PNAS for instance becomes free after a year but that does not mean its somehow less important or valuable.

(For illustration, my university library subscribes to virtually all journals in any field. But that means they spent the money, not any scientific peers of the authors publishing there. The administrators who allocated the funds don't read the journal, although they did a great favor for us researchers. At another university, this one in Europe, there were very few journals accessible so people cited mostly books or cited a paper citing another paper if they couldn't find the other online for ...)

This also answers this question: Why do tenured professors still publish in pay-walled venues?



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Gram (185 points)
+1 That is a good read, several things I did not consider, perhaps we should adapt this to a wiki QA as it seems there are many pieces to this puzzle. :)

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+4 votes
answered Aug 10, 2015 by Daniel Standage (420 points)

The other answers provide some good insight into why scientists might publish the way they do today, but I think all of them miss a pretty obvious and important point: the history of publishing.

Scientific findings have been published in print for hundreds of years, even if the concept of peer review is more recent1. Over the majority of this time period, scientists did not work with large data sets with the frequency and ease that we do today, and publishing "code" was certainly not common. A small data set could simply be published within an article, probably in the form of a table or figure, which could be distributed by photocopy or transcribed by hand.

Fast-forward to the 2010s: software is a critical intellectual and technical component in most areas of scientific research, and huge data sets can be disseminated openly, with ease, and with little to no cost. Distributing data and code inside a print article is rarely realistic these days, and even though journals typically publish online versions of all articles now, how to integrate supporting code and data is a big challenge—or at least publishers make it out to be a big challenge.

I would attribute the rapid advance in computing and networking technology as a primary cause in many (most) cases of "closed" thinking when it comes to publication. Many publishers and senior scientists are simply struggling to find their feet in this brave new world, and holding on to decades-old practices and values: the practices and values under which they were trained, and their mentors were trained.


1Baldwin M (2015) Credibility, peer review, and Nature, 1945–1990. Notes and Records of the Royal Society, 69, 337-352, doi:10.1098/rsnr.2015.0029.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+3 votes
answered Aug 11, 2015 by Michael (140 points)

One two main reasons I can think of are:

  1. Laziness or lack of time. Most of the time code (in science) is in a form that is not readable by someone who did not write it. Making it publishable in a proper form would take quite a lot of time and scientists have better thing to do. It is possible to write the code "readable" from the beginning, but this requires planning in advance. This rarely happens in science, since most scientists were not trained in programming and whatever they know is learned on the job.

  2. Not realising it's important enough. This is not restricted to code. How many times have you read a paper with a poorly written Materials and Methods section? On the one hand, it is quite boring to read "we used a 1 ml pipette..." in papers, but on the other hand this information is crucial when trying to reproduce the work. Code sits in the same spot as pipettes. It is a method, and scientists don't really care about it. They care about the results and the conclusions. How the results were actually achieved is usually less important, in the eyes of most authors.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+3 votes
answered Aug 13, 2015 by Franck Dernoncourt (540 points)

An interesting list of reasons explaining why authors refused to send code when asked can be found in section 4.3 So, What Were Their Excuses? (Or, The Dog Ate My Program) of the paper "Measuring Reproducibility in Computer Systems Research." Christian Collberg, Todd Proebsting, Gina Moraila, Akash Shankaran, Zuoming Shi, Alex M Warren. March 21, 2014.

(The paper was mentioned to me when I asked for a reference on availability of source code used in computer science research articles).



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+3 votes
answered Aug 16, 2015 by Carlisle Rainey (70 points)

In a field experiment on sociologists, Cristobal Young investigated a slightly different question. It is not the norm in sociology to publicly post reproduction files, so he wanted to know how often these would be provided upon request. The results are summarized in this blog post.

Only 28% of the 53 researchers contacted released their data. 73% did not release their data, even upon request. Here are the justifications of the 38 researchers that did not provide their data:

  • 32% - IRB/legal/confidentiality issue
  • 26% - no response
  • 16% - don't have data
  • 14% - don't have time/too complicated
  • 5% - still using the data
  • 5% - "see the article and figure it out"

Perhaps most informatively, one researcher acknowledges his/her true feeling about making data publicly available:

I don’t keep or produce "replication packages"… Data takes a significant amount of human capital and financial resources, and serves as a barrier-to-entry against other researchers… they can do it themselves.

It would seem that this barrier-to-entry sentiment would be highly underreported in most measures of why researchers don't provide their data. It might explain a lot of the resistance to making data publicly available.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+1 vote
answered Aug 4, 2015 by Simon W (155 points)

One factor is that many people will be comfortable publishing results based on messy, hurridly-written code, but will be embarrassed about sharing that same code, and it can be a considerable time investment to clean it up. There is often little incentive for putting in that time.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Neil Chue Hong (155 points)
In fact, it can be worse than this - with people being publicly shamed for releasing their messy code. I wrote about [one prominent case](http://www.software.ac.uk/blog/2013-01-25-haters-gonna-hate-why-you-shouldnt-be-ashamed-releasing-your-code) where a piece of code which was useful to others was criticised openly - something which could affect a researchers future job prospects.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Welcome to Open Science Q&A, where you can ask questions and receive answers from other members of the community.

If you participated in the Open Science beta at StackExchange, please reclaim your user account now – it's already here!

e-mail the webmaster

...