Provenance and standards for figure metadata

+5 votes
63 views
asked Aug 8, 2015 in Open Science by Ian (25 points)

I'm interested in linking the figures (png, svg, pdf) that I embed in my papers, talks, and webpages, to the scripts, codes and data that produce the figures in the first place. The hope is that by taking just the figure and a simple script, the entire environment that produced the results can be recreated (probably via a VM, despite their issues) and the process for replicating the results made clear (or as clear as the code and pipeline are).

To do this I need to embed metadata in the figure files. That's easy enough to do, but there's lots of standards for what metadata to include and how. The more standards compliant this metadata can be, the more likely it is to be preserved if/when the figures are processed by other tools.

What's the best standard(s) to follow when embedding provenance-related metadata in figure files?



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Alexander Konovalov (135 points)
If you think that this thread should be migrated to Academia or another SE site because the OpenScience beta is closing, please edit the list of questions shortlisted for the migration [here](http://meta.openscience.stackexchange.com/questions/73/).

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

2 Answers

+3 votes
answered Aug 12, 2015 by Robin Berjon (55 points)
 
Best answer

If your metadata must be embedded inside the files themselves (as opposed to residing in for instance an external manifest) then the only option that I am aware of that will work with the range of document types that you list is XMP.

I am not up to date as to the level and quality of tooling available for it, but I believe it is relatively okay. At the very least if memory serves it is rather simple to implement because you basically need to look for a standard marker, which works even in arbitrary binary streams (in PNG and PDF I believe it's in comment sections, in SVG I reckon you can just embed it in a <metadata> element).



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Ian (25 points)
This is the best approach I've seen so far - I'm currently investigating the python XMP toolkit as a good abstraction layer.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
0 votes
answered Aug 15, 2015 by Rex Kerr (95 points)

I think you're going about this backwards because it doesn't scale to all use cases. Suppose you do an analysis of 500 days of mouse behavioral video. Do you really want to embed your entire analysis code plus the entire video in your figure? I don't think so.

Instead, you should consider embedding a reference within the figure to a uniform identifier specifying what created it (DOI or URL or something). Most image formats contain comment fields that can easily contain something like this.

You might worry about the figure and analysis getting out of date, but that can be easily enough verified by binary comparison (or other more sophisticated image comparison methods, if e.g. you're using lossy compression at different levels).



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Rex Kerr (95 points)
@Ian - Unfortunately, while I seriously doubt that _this_ is a good solution, I don't have a particularly good alternative to suggest, especially if you've used cutting-edge tools. But I will note that Matlab, C++, and Java that I wrote 20 years ago all still work just fine, and I can see URLs I embedded in TIFF file "image description" fields from the same time in practically any image viewer (when viewing "image info" or the like). Do not discount the value of simplicity.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Ian (25 points)
When the simulation relies on packages from >100 github repos, with a dependency list of dozens of "standard" libraries, *before* we get to visualization, simplicity is already gone. As this is the use case I'm aiming for I've gone for this type of approach: https://github.com/IanHawke/im2sim

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Ian (25 points)
The XMP is not a figure format, but is metadata embedded in the figure - I've tested on PNG and SVG and it works, and http://python-xmp-toolkit.readthedocs.org/en/latest/introduction.html claims it works with most interesting file formats. As for the repository, a direct link to the code doesn't give replicability without the platform on which it ran. Docker is the weak point, but the Dockerfile format can (and in my case does) transparently show how to get the code from github. Please suggest a more robust alternative for archiving the whole workflow, including platform, if possible.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Rex Kerr (95 points)
@Ian - Well, I still think you're going about this the wrong way. You're considering an exotic and little-used format, XMP, to link to a cutting-edge technology that is already under fire from multiple competitors. This seems to be setting yourself up for unreadable files that link to unreadable repositories. Instead, you should use the most common image formats (PNG, SVG, etc.) and link to something that is likely to be around for a very long time (e.g. GitHub repository, or one of the most stable data repositories). That is, if you care that it be available for more than a couple years.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Ian (25 points)
That's precisely what I'm intending to do, via a reference to a docker container.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Welcome to Open Science Q&A, where you can ask questions and receive answers from other members of the community.

If you participated in the Open Science beta at StackExchange, please reclaim your user account now – it's already here!

e-mail the webmaster

...