No Namespaces in JSON, Please

First, I meant to just try to comment on Norman Walsh’s post “Namespaces in JSON?”, but this ballooned to a post on its own.

In some ways, this whole thing feels very old and everything has been said already. But, still, it’s so hard to shut up and worry about other things after all the unhappiness Namespaces in XML has caused me over the years…

The TL;DR is that I think that experience from Namespaces in XML should lead to the conclusion not to repeat the same (or almost same) thing with JSON. I think the developer community as a whole should not pay the cost of the use cases of the part of the developer community that believes (out of the scope of this post if rightly or wrongly) that identifiers in data formats should fit into a global naming scheme and, more specifically, that naming scheme should make every identifier into a URI. Instead, I think that the part of the developer community that believes that it needs to be able to merge data thanks to identifiers being URIs should bear the cost of doing whatever name mangling it needs to do upon data ingest given the information of which format a given ingested piece of JSON was in.

The Scenario

The basic scenario is that someone who doesn’t feel that they need global naming creates a JSON format where the keys in objects are short and relevant to the format at hand but not guaranteed to be globally unique. Perhaps they aren’t even unique within the format itself but contextual. For example, objects representing one kind of things and objects representing other kind of things can use a key of the same name it within the same data format if the way how you reach a given object from the root disambiguates which kind of object a given object is.

Now along comes someone else who says that they have use cases where data published in this data format needs to be merged with data from other JSON formats or perhaps even with the data in non-JSON formats and asks for the data format to be redone in a way that makes the object keys of the format participate in a global naming scheme.

Why can’t everyone just get along, and why isn’t it merrier to address more use cases?

Complexity and Ergonomics

The fundamental problem with global naming schemes is that the global names are too unwieldy to use all the time. For example, Java uses reverse DNS names to leverage Internet naming as part of a global naming scheme for identifiers. Like this: nu.validator.htmlparser.dom.HtmlDocumentBuilder. The fundamental meme of the Semantic Web / Linked Data is that identifiers are URIs. Like this: http://purl.org/dc/elements/1.1/title.

Always using these fully qualified names is an ergonomic problem simply because the name are too long to type all the time. (Note how SemWeb/LD-compliant names are even worse than Java names in the sense that they carry the http:// boilerplate.)

Trying to solve this ergonomic problem creates complexity. Generally, some mechanism ends up introduced that allows producers of the format to make some upfront declarations about the long stuff and then go on using just the short last component of the name.

The Java way of doing this really allows producers of Java source code (Java programmers) to refer to just to the last component of the name, in this case HtmlDocumentBuilder after some up-front incantation. Namespaces in XML allows this approach for element names but not for attribute names. For attribute names the long stuff needs to be bound to a shorter intermediate symbol. For example, you'd bind http://purl.org/dc/elements/1.1/ e.g. to dc and then refer to dc:title. This introduced a confusion that Java doesn’t suffer from: People believing that the intermediate symbol had some significance and that it’s OK to look for a common prefix without expanding it.

One option, mentioned (which is rather refreshing on this topic) in Norman Walsh’s post linked to at the beginning, is to avoid the indirection complexity by just using the long names in the data format and not having any data format-level shortening mechanism. This moves the complexity of addressing the ergonomic problems elsewhere. For example, a decade ago, when I was dealing with Namespaces in XML more often than I am these days, I had a utility that hooked into the accessibility APIs of the operating system to allow me to type a combination of a few letters that normally doesn’t occur in English, Finnish or Web formats and have that expanded to the XHTML namespace in any text editing application. Maybe the complexity isn’t in the syntax but moves to some tooling level.

And syntax-based solutions are about addressing the ergonomics on the data format producer side. Correctly-behaved consumers need to expand the names to their fully-qualified form up front and then potentially invent application-specific abstractions for hiding the too long forms away again. Cost all around.

Aside: In the case of Namespace in XML, the consumer-side problem is even worse than with Java or SemWeb naming: in the case of Namespaces in XML, each name doesn’t expand into one long string. Instead, they expand to a string pair. This creates additional consumer-side ergonomic problems especially when the programming environment insists on allocating everything on the heap and/or has an interning mechanism for strings only leading to programmers having to deal with two strings everywhere and to avoid the performance cost of an added abstraction of a string pair type.

Bearing the Cost

I think it’s not cool socially to impose the ergonomic problem or the complexity and costs arising from trying to alleviate the ergonomic problem onto people who wouldn’t have that problem with JSON to begin with, because they would just happily use short object keys, which have the ergonomics of short-enough to type string literals in whatever programming language, to begin with.

I think the people who do have the use cases that require cross-format unique naming should be the ones deal with the cost of their use cases and write format-specific converters that run at the ingest boundary of their multi-format mashing systems.

So I think the appropriate response is not to complicate JSON-based formats with global naming. Saying “No” to e.g. JSON-LD should not be viewed as not getting along with the LD community. Instead, it should be OK to view asking others to take on ergonomic problems and complexity as not cool.

C++ has the meme that “you don’t pay for what you don’t use”. Global naming is the opposite of that. If it’s baked into the format, you pay for it even if it’s YAGNI for you.

Epilog

(To be clear, this epilog isn’t in response to anything Norman Walsh’s post said.)

But It Works for Programming Languages! Explain that!

There are programming languages, such as Java but by no means limited to Java, that have the concept of things having fully-qualified names and a mechanism for importing the last component of the fully-qualified name into use so that the programmer doesn’t need to type the fully-qualified name every time. If this works for programming languages why would anyone be against it for data formats?

The superficial reason is that there are fewer people writing compilers or interpreters than people writing software that consumes various data formats, but that’s not actually the key reason.

The second slightly less superficial reason is that the naming scheme would fall down immediately in the case of compilers and interpreters if the implementors cheated and tried to avoid the name expansion step. Meanwhile, when consuming data formats, if Namespaces in XML as used for RDF-looking purposes is any indication, it is common (even encouraged; see the next section) for people to try to get away with not expanding the names and when there are enough systems that need to interoperate at some point that shows but software may still make it to deployment first.

Aside: Anecdote superficially to the contrary but not really: I've written software that consumes Java code without expanding the name, but I got away with it only because the piece of software only works with a specific set of .java files and isn’t suitable for general use.

I think the profound reason why the concept of local short identifiers expanding to long fully-qualified names in the processing model works for programming languages is that compilers generally don’t do work specific to a particular name. A Java compiler has no code for doing some specific thing just for nu.validator.htmlparser.dom.HtmlDocumentBuilder. The fully-qualified names go into hash table somewhere, but the compiler developer doesn’t need to write string literals to pull specific things out of the hash by a manually-written specific name. In contrast, while people who advocate for global naming value generic processing of data, typical applications that consume data formats need to do specific processing for particular parts of the format.

An application that processes a JSON-based data format is likely to have code written for dealing with each kind of object key specifically, which means that the programmer has to actually refer to specific names, which means that the consumer-side view of the ergonomic issues is present all the time. This is very different from how compilers and linkers deal with the named things.

But You Can Ignore the Complexity!

One way of selling why an XML-based format should have become an RDF-based format or why a JSON-based format should become a JSON-LD-based format is to say that making things syntactically RDF then or LD now doesn’t impose any cost onto consuming software that isn’t doing RDF/LD processing, because you can simply ignore the RDF/LD bits if you don’t need them but let the people who do want them benefit from them.

This simply isn’t how format compatibility works at scale. You can’t just make a processing layer optional and have everything be happy for everyone.

One of three things happens:

The optional processing layer introduces enough syntactic sugar that some producers start relying on the quasi-optional layer (JSON-LD in the JSON case) being there and consumers that didn’t want to buy into the quasi-optional layer are forced to implement the layer that was sold as optional.
Producers don’t test with software that uses the quasi-optional layer, so what they output is broken for the purposes of the quasi-optional layer The people who wanted the quasi-optional layer to be there can’t get the benefits anyway.
A messy mixed state of the two above options without clearly converging on either.