Xqueeze: Compact XML Alternative

Why not plain XML for Web Services?

In this write-up, I try to explain some of the reservations that crop up when one thinks of using a markup language like XML in resource intensive applications like Web Services.

Introduction

In the second phase of revitalization of the commercial Internet applications, web services are going to play a major role. With the web services comes the need of the ability for disparate systems to talk in a mutually understandable language. This need is nicely fulfilled by XML since it is an open standard and it's powerful enough to be able to convey the type of business information we need to exchange. But there are problems that crop up when we try to use markup languages to do heavy-duty jobs. And the use of XML for web services has it's share too [1]. Have a look:

The Problems:

Low information to data ratio: Markup languages typically have a lot of data in their tags. As a simple measure of how much overhead this causes, just save any webpage in "HTML only" format and plain-text format and compare their sizes. As an example, my Homepage on Konark (NCST's main WWW server) sizes up to 4263 bytes in HTML. The same page in plain text measures 1463 bytes only! What this means is that the markup information occupies 191% (nearly twice) as much data as the real information (the plain text). Keep this 2:1 ratio in mind as we're going to need this information in the arguments below

Wastage of Computing Resources: Millions (Billions?) of page hits are recorded daily around the world. Each page hit involves the webserver locating and dishing out the requested page to the client browsers. The browsers decode the markups to display the page for the end-user. Does the end user do anything with that HTML? S/he wouldn't care more if the page were written in some gibberish collection of characters as long as the browser was able to decipher the HTML correctly. The question i ask is, Does the web-client need the explicitly elaborate markup tags? The simple answer is, NO. The browser wouldn't complain if the <table> tag was instead written as, say, @23. Rather, it takes the browser more time parsing and decoding the 7 characters that make up the <table> tag than the three characters of @23.

This is one part of the story. With the type of information we're going to send down the wire while using web services, privacy is a major concern. To protect the sensitive information from malicious sniffers, we're going to need encryption. Encryption itself requires a lot of computing resources. It also adds up to data overhead that is transferred between the peers. On my homepage, only 1 part data is the sensitive information that I might want to protect, the remaining two parts are markup information. Now, when I send my homepage down over an encrypted channel, I'm only concerned about concealing the 1 part information. Let the sniffers make merry with the markup. Is it possible to encrypt just the information and send it mixed with unencrypted markup? As of now, NO. Moral of the story: we are going to spend most of our computing resources decoding, encrypting and decrypting the elaborate markup.

Wastage of bandwidth: Needless to say, if I'm sending two parts markup to convey one part information, I'm majorly wasting bandwidth. Some experts have suggested end-to-end compression for sending data across. It's quite spectacular too, because compression can bring down the size of clear text upto thirty-fold. It solves the bandwidth problem, alright. But who's got the computing resources that it takes to compress and decompress terabytes of data that are exchanged across Internet connections daily?

Who are we doing this for? Markups are wonderful tools for programmers with their explicitly obvious terms and tags. So when you have to draw a table in HTML, you start with the <table> tag, and move on to specifying attributes like cellspacing=0. Good for the lazy bums that programmers are (and should be, according to Larry Wall) but how much does a human being interact with HTML? Only during the development - debugging - maintenance phase. But the machines, on the other hand, have to deal with the HTML day in and day out. Does the webserver need the entire word #include to be told about an SSI request? No, the #include is for us. Does the web browser need to be told the whole background-repeat-y story? No, it makes things easier for us developers. And we developers only interact with the markup tags while developing the pages. Why then, after the page is done, we make the machines do the grind of processing the oh-so-long-markup-tags for the entire life-time of the page?

Unicode

Unicode is another inevitablity in the world of web services. In the beginning there was ASCII - the 7 bit coding used to represent the English alphabet along with some special characters [2]. 7 bits can have 2^7 = 128 permutations. That is enough to accomodate 127 characters and the indispensable NULL. As the development of Computers grew beyond the borders of the English speaking societies, the need to incorporate more characters in the basic encoding scheme led to the development of ISO-8 bit encoding schemes [3]. An 8 bit encoding lets us double the permutations to 256, thus making space for 255 characters and the NULL. There is a whole series of ISO 8 bit character sets. ISO-8859-1, for instance, is the Latin characters set. Very soon, however, ISO encoding reached it's limits too as the need to accomodate Asian character sets arose and it became painful to keep on adding ISO charactersets and conformance proved difficult. This led to the development of Unicode [4]. Unicode is a 16 bit encoding allowing us the luxury of 65535 characters and the NULL. Assuming the average number of characters in an alphabet to be 52, Unicode can support over 1250 distinct character sets and still leave space for more.

As we are moving towards computing at a global scale, the need for native support of multiple character sets is driving more and more systems towards adoption of unicode. Already the major DBMSes have built-in suport for Unicode. Most Programming Languages now support Unicode. Perl even uses Unicode as the default encoding scheme. Given the multi-lingual support that is required of Web Services, it is inevitable that the communication will be in Unicode. But think of it, if we start using Unicode for exchanging documents in markup languages like XML, what becomes of the above mentioned problems? The existing ISO charactersets are 8 bit wide and Unicode will double that to 16 bits, aggravating the above mentioned problems!

Is there a choice? The prospect of using XML with Unicode is very appealing when it comes to integration that Web Services require. But the more markup intensive we get, the more problems we face due to computation and bandwidth requirements due to the combination of the deadly duo of XML and Unicode. So do we go along with the combination starting out with a low QoS and hoping that as we move on, faster systems will come to the rescue? Will those systems come soon enough? Or do we discard the whole idea as unviable?

References

[1]: Take a look at this ZDNet aritcle, Fat protocols slow web services, and the talkback discussion.

[2]: American Standard Code for Information Interchange from FOLDOC

[3]: ISO 8859 from FOLDOC

[4]: Unicode from FOLDOC,
also The Unicode Standard: A Technical Introduction

[5]: New in Unicode 3.0

- Write-up by:
Tahir Hashmi (VSE, DAKE)
Dated: 22 Jan, 2002

Project Links: