What is the most prominent example of reuse of open scientific data?

+6 votes
98 views
asked Aug 15, 2015 in Open Science by Thomas (915 points)

A major part of open science is data sharing and an increasingly large amount of scientific data is now being shared publicly and freely. A question, however, remains about the extent to which such data are actually used in further scientific work (as opposed to mere verification or reproduction of the original results). What is the most prominent example of new scientific work completed using data reused from earlier open science projects?



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Alexander Konovalov (135 points)
If you think that this thread should be migrated to OpenData (or Academia or another SE site) because the OpenScience beta is closing, please edit the list of questions shortlisted for the migration [here](http://meta.openscience.stackexchange.com/questions/73/).

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by rmounce (110 points)
Human Genome Project has definitely gotta be up there as a biggy. GenBank / Nucleotide sequence sharing in general too.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

1 Answer

+4 votes
answered Aug 16, 2015 by Daniel Mietchen (1,200 points)

I don't know about a ranking that would provide the answer to your question, but would be interested in seeing one, so off we go.

I agree that the platforms mentioned by Ross are amongst the most likely contenders, and I recently read Analyzing data citation practices using the Data Citation Index, which found

that the repository with more citations is specialized in Crystallography (Crystallography Open Database), followed by the Protein Data Bank (Biochemistry &Molecular Biology) and the Inter-university Consortium for Political and Social Research (Social Sciences, Interdisciplinary.

Other resources that are widely used include PubMed/ MEDLINE (which are not open in the sense of the Open Definition but useful for comparison), the Sloan Digital Sky Survey and the Database of Genotypes and Phenotypes.

So, to get a first rough idea of the scale of use we are talking about here, let's throw all of these (either using complete titles or abbreviations) into Google Scholar and record the number of hits:

  1. "PubMed" 5,430,000
  2. "MEDLINE" 1,250,000
  3. "Genbank" 720,000
  4. "Protein Data Bank" 182,000
  5. protein PDB 142,000
  6. "Human Genome Project" 86,600
  7. "Sloan Digital Sky Survey" 38,000
  8. "UniProt" 47,400
  9. crystallography COD 49,800
  10. dbGaP 44,200
  11. "database of Genotypes and Phenotypes" 40,700
  12. "Inter-university Consortium for Political and Social Research" 23,800
  13. ICPSR data 20,800
  14. HGP genome 11,900
  15. "Crystallography Open Database" 596

Of these, the PubMed one is an outlier, due to many articles simply having a PubMed ID, rather than using PubMed. The results for MEDLINE include things like papers by people named Medline, but probably most are legit for our purposes, whereas the difference between crystallography COD and "Crystallography Open Database" illustrates that this method may well be off target by several orders of magnitude.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Welcome to Open Science Q&A, where you can ask questions and receive answers from other members of the community.

If you participated in the Open Science beta at StackExchange, please reclaim your user account now – it's already here!

e-mail the webmaster

...