|
Why not plain XML for Web Services?
In this write-up, I try to explain some of the reservations
that crop up when one thinks of using a markup language like XML
in resource intensive applications like Web Services.
Introduction
In the second phase of revitalization of the commercial Internet
applications, web services are going to play a major role. With the
web services comes the need of the ability for disparate systems to
talk in a mutually understandable language. This need is nicely
fulfilled by XML since it is an open standard and it's powerful enough
to be able to convey the type of business information we need to
exchange. But there are problems that crop up when we try to use
markup languages to do heavy-duty jobs. And the use of XML for web
services has it's share too [1]. Have a look:
The Problems:
- Low information to data ratio: Markup languages typically
have a lot of data in their tags. As a simple measure of how much
overhead this causes, just save any webpage in "HTML only"
format and plain-text format and compare their sizes. As an
example, my Homepage on Konark (NCST's main WWW server) sizes up to
4263 bytes in HTML. The same page in plain text measures 1463 bytes
only! What this means is that the markup information occupies 191%
(nearly twice) as much data as the real information (the plain
text). Keep this 2:1 ratio in mind as we're going to need this
information in the arguments below
- Wastage of Computing Resources: Millions
(Billions?) of page hits are recorded daily around the world. Each
page hit involves the webserver locating and dishing out the
requested page to the client browsers. The browsers decode the
markups to display the page for the end-user. Does the end user do
anything with that HTML? S/he wouldn't care more if the page were
written in some gibberish collection of characters as long as the
browser was able to decipher the HTML correctly. The question i ask
is, Does the web-client need the explicitly elaborate markup
tags? The simple answer is, NO. The browser wouldn't complain if
the <table> tag was instead written as, say,
@23. Rather, it takes the browser more time parsing
and decoding the 7 characters that make up the <table> tag
than the three characters of @23.
This is one part of the story. With the type of information
we're going to send down the wire while using web services, privacy
is a major concern. To protect the sensitive information from
malicious sniffers, we're going to need encryption. Encryption
itself requires a lot of computing resources. It also adds up to data
overhead that is transferred between the peers. On my
homepage, only 1 part data is the sensitive information that I might
want to protect, the remaining two parts
are markup information. Now, when I send my homepage down over an
encrypted channel, I'm only concerned about concealing the 1 part
information. Let the sniffers make merry with the markup. Is it
possible to encrypt just the information and send
it mixed with unencrypted markup? As of now, NO. Moral of the
story: we are going to spend most of our computing resources
decoding, encrypting and decrypting the elaborate markup.
- Wastage of bandwidth: Needless to say, if I'm sending two
parts markup to convey one part information, I'm majorly wasting
bandwidth. Some experts have suggested end-to-end compression for
sending data across. It's quite spectacular too, because compression
can bring down the size of clear text upto thirty-fold. It solves
the bandwidth problem, alright. But who's got the computing
resources that it takes to compress and decompress terabytes of data
that are exchanged across Internet connections daily?
- Who are we doing this for? Markups are wonderful tools
for programmers with their explicitly obvious terms and tags. So
when you have to draw a table in HTML, you start with the
<table> tag, and move on to specifying attributes like
cellspacing=0. Good for the lazy bums that programmers are
(and should be, according to Larry Wall) but how much does a human
being interact with HTML? Only during the development - debugging -
maintenance phase. But the machines, on the
other hand, have to deal with the HTML day in and day out. Does the
webserver need the entire word #include to be told about an SSI
request? No, the #include is for us. Does the web browser
need to be told the whole background-repeat-y story? No, it
makes things easier for us developers. And we developers only
interact with the markup tags while developing the pages. Why then,
after the page is done, we make the machines do the grind of
processing the oh-so-long-markup-tags for the entire life-time of
the page?
Unicode
Unicode is another inevitablity in the world of web services. In
the beginning there was ASCII - the 7 bit coding used to represent the
English alphabet along with some special characters [2]. 7 bits can have
2^7 = 128 permutations. That is enough to accomodate 127 characters
and the indispensable NULL. As the development of Computers
grew beyond the borders of the English speaking societies, the need to
incorporate more characters in the basic encoding scheme led to the
development of ISO-8 bit encoding schemes [3]. An 8
bit encoding lets us
double the permutations to 256, thus making space for 255 characters
and the NULL. There is a whole series of ISO 8 bit character
sets. ISO-8859-1, for instance, is the Latin characters set.
Very soon, however, ISO encoding reached it's limits too as the need to
accomodate Asian character sets arose and it became painful to keep on
adding ISO charactersets and conformance proved difficult. This led to the development of
Unicode [4]. Unicode is a 16 bit encoding allowing us the luxury of
65535 characters and the NULL. Assuming the average number of
characters in an alphabet to be 52, Unicode can support over 1250
distinct character sets and still leave space for more.
As we are moving towards computing at a global scale, the need for
native support of multiple character sets is driving more and more
systems towards adoption of unicode. Already the major DBMSes have
built-in suport for Unicode. Most Programming Languages now support
Unicode. Perl even uses Unicode as the default encoding scheme.
Given the multi-lingual support that is required of Web Services,
it is inevitable that the communication will be in Unicode. But think
of it, if we start using Unicode for exchanging documents in markup
languages like XML, what becomes of the above mentioned problems? The
existing ISO charactersets are 8 bit wide and Unicode will double that
to 16 bits, aggravating the above mentioned problems!
Is there a choice? The prospect of using XML with
Unicode is very appealing when it comes to integration that Web
Services require. But the more markup intensive we get, the more
problems we face due to computation and bandwidth requirements due to
the combination of the deadly duo of XML and Unicode. So do we go
along with the combination starting out with a low QoS and hoping that
as we move on, faster systems will come to the rescue? Will those
systems come soon enough? Or do we discard the whole idea as
unviable?
References
[1]: Take a look at this ZDNet aritcle, Fat protocols slow web services, and the talkback
discussion.
[2]: American Standard Code for Information Interchange from
FOLDOC
[3]: ISO 8859 from FOLDOC
[4]: Unicode from FOLDOC,
also
The Unicode Standard: A Technical Introduction
[5]: New in Unicode 3.0
- Write-up by:
Tahir Hashmi (VSE, DAKE)
Dated: 22 Jan, 2002
|
Project Links:
|