Which one among JSON and XML is the best format to release annotated texts?

+3 votes
60 views
asked Aug 7, 2015 in Open Science by Franck Dernoncourt (540 points)

I have some annotated texts in both JSON and XML formats. What considerations should I take into account to decide whether I should release the data as JSON or XML?



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Franck Dernoncourt (540 points)
OpenData is indeed the best destination I believe

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Alexander Konovalov (135 points)
Should this be migrated to OpenData? Or, if you have some specific area which is a topic of that text, on the Q&A site for that area, if such exists?

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Daniel Mietchen (1,170 points)
I'm with @bsmith89 here but wondering why - if you have it in both formats already - you don't share both?

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Scott Chamberlain (410 points)
Can you share an example, or link to one at least? I sort of prefer JSON these days, but XML is fine as well.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by bsmith89 (0 points)
What do you mean by "texts"? It seems to me that something monolithic like a web-page would suggest XML; a series of more atomic items, JSON.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

2 Answers

+1 vote
answered Aug 9, 2015 by Rex Kerr (95 points)

If the structure of the data is not exceedingly complex, you should favor JSON because JSON is faster and easier to parse automatically, and also easier to read as a human. In principle, XML schema could be used to automatically identify parts of your data, but in practice there are so many different ideas for how to present data that it doesn't really work.

Thus, if you have to choose one, JSON. If you want to be super-nice, you could provide both.

There is one problem with JSON: it doesn't understand infinite and not-a-number floating point values. If you have lots of these in your data, you need some way to deal with this. Your favorite tools probably already have their own workaround. Alas, it's not standard yet. But XML doesn't even have a standard for how to present a number, so you're still modestly ahead with JSON.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
+1 vote
answered Aug 10, 2015 by Robin Berjon (55 points)

JSON is generally poorly suited for markup. Unless the structure of the documents is extremely simple it will be difficult for humans to read the source and make much sense from it. To the best of my knowledge, JSON is used in some text-oriented systems but only to save an internal representation that the application makes use of, not for interchange purposes.

For document-oriented content, XML will work much better. You can interleave content and structure more readably. Don't bother with anything like XML Schema, it's absolutely useless for documents, but if you can reuse an existing language it's even better as your users might have a chance to reuse tooling they may already have.

Finally, have you considered HTML? It has a lot of tooling (you're likely reading this in one) and a lot of users. It has become pretty good at capturing the core structure of documents, and has the extension points that make it possibly to overlay richer semantics onto a document. I would certainly recommend looking at it twice, it's usually a very good choice.



This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Rex Kerr (95 points)
I guess we have very different ideas about the text-to-annotation ratio! I wonder if the OP will clarify? I'd agree with your answer if the text has most of the content, and the annotations are things like font changes and footnotes. My answer was written assuming something like a statistical analysis of that text vs. others, pointing out stuff like changes in word frequency, statistically distinguishing N-grams, etc..

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Robin Berjon (55 points)
@RexKerr Indeed, if the content is basically a dump of textual analysis, JSON is better. But if the primary goal is to overlay structure on text, then that's what markup was invented for! Things get notably awkward in JSON with mixed content, i.e. a run of text an elements interleaved. The XML or HTML encoding of such data will be a lot more readable.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
commented Aug 18, 2015 by Rex Kerr (95 points)
Well, there's a clear isomorphism between `acd` and `["a",{"+": "b"},"c",{"-": "b"},"d"]` and `{"acd", {"tag": "b", "range":[1,2]}}`, but if I'm mostly reading by eye rather than using a computer to load the data, yeah, the first is better.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Welcome to Open Science Q&A, where you can ask questions and receive answers from other members of the community.

If you participated in the Open Science beta at StackExchange, please reclaim your user account now – it's already here!

e-mail the webmaster

...