Xqueeze Evolution and Feature Details

XML is a meta-language for defining SGML compliant languages. XML has found wide-spread use in definition of file-formats and some communication protocols, most notably the Web Services protocols. Among the objectives of XML was easy human-readability and minimal importance of terseness. Given the fact that XML is now being used in high-volume machine-to-machine interactions, the verbosity of XML becomes a deteriment to efficiency while no gain is made from the readabilty aspect of XML. Xqueeze is an alternative to XML for more efficient machine-to-machine interaction with minimum loss of flexibility.

Introduction

XML is a meta-language that allows for quick and convenient of SGML Compliant document types. XML is now a format of choice for a large category of applications including XML databases, messaging and communication protocols, document formats and even a format for movies (the upcoming MPEG 4 format), among many others.

There are several applications where an XML document is generated by some software and consumed by some other software, without human intervention in between. Examples of such applications are Web Services brokers and agents, XML-based instant messaging clients and servers, productivity applications using XML based document formats like OpenOffice.org.

The readability aspect of XML that is cause of much of it's verbosity is not useful to deployments where XML is handled solely by machines from it's point of generation to the point of consumptions, without human involvement. Such applications pay with extra size overhead for a feature, readability, that they don't use.

XML Compression

XML documents are easily compressed using traditional as well as XML-specific compression algorithms. This offers a way to restrict the size of XML documents where it becomes a problem. There are, however, some short-comings of XML compression that pose problems in universal acceptance of this scheme:

XML Compaction

Independent of the attempts for XML Compression, attempts have been made at reducing the verbosity of XML by modifying it's structure. There is a lot of redundancy in XML that has come in due to the support for human-readability and lack of concern for terseness. For example, in a well formed XML document, it is unnecessary for a closing tag to contain the name of the tag it closes. This is very similar to scope delimiting achieved by the '{' and '}' characters in several programming languages.

Other attempts include building a numbered index of element names encountered in the document, adding an element name to the index on it's first use and referring to it by it's corresponding number in subsequent usage. Here's an example:

Though these compaction techniques help in reducing the size of the document, the reduction factor (size(original) / size(compacted)) is not very high. Also, some of these techniques may increase the complexity of the parser, thereby offsetting any performance gains obtained due to small file-sizes.

There have been attempts at more aggressive compaction by "binarization" of XML, the most notable example being WAP Binary XML (WBXML). The shortcoming of WBXML is that it is defined for a few well-known document types only. A few other binarization attempts were found but they were limited in their applicability.

A few attempts have also been made at creating alternative markup schemes that do not comply with SGML but they have met with limited success.

Xqueeze

Xqueeze attempts to tackle XML compaction with a combination of several techniques, some of which have been mentioned above. The objectives for Xqueeze are:

Xqueeze has a combination of features that set it apart from any of the above mentioned attempts of XML Compaction. These are directly parsable format, independent data dictionary, similar structure to XML and non-redundant markup. Let us examine these aspects in detail.

Directly parsable format

xqML is a directly parsable format that can be resolved into valid XML structures without the need of XML anywhere in the generation - consumption process. xqML format is simple enough to build small and efficient parsers for. An xqML parser can also be written as a plugin for pre-existing DOM or SAX based XML parsers since the only thing that differs is the method of extracting various structural XML units.

Independent Data Dictionary

Unlike many other attempts involving data dictionaries, the Xqueeze data dictionary is not per document. Rather, it is per document type. This is a direct counterpart of XML DTD's and Schemas. A data dictionary is constructed out of the DTD/Schema that a document complies with and that data dictionary is used to generate xqML - the Xqueeze Markup Language.

Xqueeze Association is generated using an algorithm that guarantees same associations for the same specification (DTD/Schema). Thus, it need not be passed on by the generator to the consumer. The latter is capable of creating it's own copy of Xqueeze Association if it knows the specification (DTD/Schema) that a given xqML document complies to.

Thus, small xqML documents save on the space of carrying a data dictionary. The parser also need not reconstruct a data dictionary in it's memory for each and every document, as long as they comply with the same specification (DTD/Schema).

Similar Structure

xqML is structurally very similar to XML. The reduction in size is achieved through the following means:

Structural similarity enables construction of parsers that can easily provide the functionality of DOM or SAX based XML parsers. Structural similarity also leads to similar gains using compression. In one experiment we compressed a 12 kB XML file and it's corresponding xqML representation. The latter turned out to be ~ 50% of the former in size.

Non-redundant Markup

xqML does not contain any redundant markup. It may be argued that this would reduce the syntactical error detection capabilities of xqML. While this is true, it should not be a major deteriment to the applicability since xqML (or XML) generation software is expected to be thoroughly tested for correct syntax before being deployed.

As far as possible, delimiting of various structural units (start tags, attribute values etc.) is done without using end-delimiters. The delimiters used are mostly single characters. No markup is used for representing information that would already be available to the parser during the course of parsing (eg. name of the element to close when an end tag is encountered).

xqML Grammar and Notation

As mentioned earlier, xqML is structurally very similar to XML. The greatest contributors to xqML's compact nature are the elimination of redundant information and representation of XML identifiers (NMTOKENs) with binary symbols.

Xqueeze Symbol Association

Xqueeze uses an association between symbols and their corresponding XML identifiers and types as defined in a specification (DTD/Schema). This enables representation of known identifiers in the markup with symbols. Associating the type of an identifier along with it's name also makes it easy to various structural units of the document without having to use too many special characters and character-combinations.

Xqueeze Association Algorithm

This simple algorithm assures that the assignments would remain the same even if a particular specification (DTD/Schema) has slight variations in the way it's written in the generator's and consumer's copies, as long as both define the same things.

Xqueeze: Compact XML Alternative