HTTP Compression Speeds up the Web: Page 2: WebReference.com
HTTP Compression Speeds up the Web
What is IETF Content-Encoding (or HTTP Compression)?
In a nutshell... it is simply a publicly defined way to compress HTTP content being transferred from Web servers down to browsers using nothing more than public domain compression algorithms that are freely available.
"Content-Encoding" and "Transfer-Encoding" are both clearly defined in the public IETF Internet RFCs that govern the development and improvement of the HTTP protocol which is the 'language' of the World Wide Web. "Content-Encoding" applies to methods of encoding and/or compression that have been applied to documents before they are requested. This is also known as "pre-compressing pages." The concept never really caught on because of the complex file maintenance burden it represents and there are few Internet sites that use pre-compressed pages of any description. "Transfer-Encoding" applies to methods of encoding and/or compression used during the actual transmission of the data itself.
In modern practice, however, the two are now one and the same. Since most HTTP content from major online sites is now dynamically generated, the line has blurred between what is happening before a document is requested and while it is being transmitted. Essentially, a dynamically generated HTML page doesn't even exist until someone asks for it. The original concept of all pages being "static" and already present on the disk has quickly become an 'older' concept and the originally well-defined separation between "Content-Encoding" and "Transfer-Encoding" has simply turned into a rather pale shade of gray. Unfortunately, the ability for any modern Web or proxy server to supply "Transfer-Encoding" in the form of compression is even less available than the spotty support for "Content-Encoding."
Suffice it to say that regardless of the two different publicly defined encoding specifications, if the goal is to compress the requested content (static or dynamic) it really doesn't matter which of the two publicly defined encoding methods is used, the result is still the same. The user receives far fewer bytes than normal and everything happens much faster on the client side. The publicly defined exchange goes like this:
- A browser that is capable of receiving compressed content indicates this in all of its requests for documents by supplying the following request header field when it asks for something....
- "Accept-Encoding: " and a comma-separated list of encoding names, including (hopefully)
gzip. There are other compressions out there, like "deflate" and "compress." But only
gzipis supported by most modern browsers. Some very new browsers even allow the user to configure which HTTP headers to send: Opera 6 allows you to explicitly set the HTTP level, and Mozilla 0.9.9 allows you to set the "Accept-Encoding" string (which may be problematic, as Mozilla doesn't understand each and every fancy encoding scheme).
- When the Web server sees that request field then it knows that the browser is able to receive compressed data in one of two formats, either standard GZIP or the UNIX "compress" format. It is up to the server to compress the response data using either one of these methods (if it is capable of doing so).
- If a compressed static version of the requested document is found on the Web server's hard drive which matches one of the formats the browser says it can handle then the server can simply choose to send the pre-compressed version of the document instead of the much larger uncompressed original.
- If no static document is found on the disk which matches any of the compressed formats the browser is saying it can "Accept" then the server can now either choose to just send the original uncompressed version of the document or make an attempt to compress it in "real-time" and send the newly compressed and much smaller version back to the browser.
Most popular Web servers are still unable to do this final step.
- The Apache Web Server which has over 50% percent of the Web server market is still incapable of providing any real-time compression of requested documents even though all modern browsers have been requesting them and capable of receiving them for more than two years.
- Microsoft's Internet Information Server is nearly as deficient. If it finds a pre-compressed version of a requested document it
might send it but has no real-time compression capability. It will, however, use precompressed files if they are available.
IIS 5.0 uses an ISAPI filter to support GZIP compression. It works as follows. The user requests a page, the server sends the page and then stores a copy of it "compressed" in a temporary folder. The next time a user requests the page it sends the one stored in the temp directory.
What it then tries to do is constantly check that the pages in the temp directory are always current, and if not gets a current page and then compresses it.
- IBM's WebSphere Server has some limited support for real-time compression but it has "appeared" and "disappeared" from the marketplace through various release versions of WebSphere.
The original designers of the HTTP protocol really did not foresee the current reality with so many people using the protocol that every single byte would count. The heavy use of pre-compressed graphics formats such as GIF and the relative difficulty to further reduce the graphics content makes it even more important that all other exchange formats be optimized as much as possible. The same designers also did not foresee that most HTTP content from major online vendors would be generated dynamically and so there really is no real chance for there to ever be a "static" compressed version of the requested document(s). However, there is the possibility to cache even dynamic content, as long as you know something about it, like it cannot change in real-time but only at some occasions. Public IETF Content-Encoding is still not a "complete" specification for the reduction of Internet content but it does work and the performance benefits achieved by using it are both obvious and dramatic.
What is GZIP?
It's a lossless compressed data format. The deflation algorithm used by GZIP (also zip and zlib) is an open-source, patent-free variation of LZ77 (Lempel-Ziv 1977). It finds duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance, length), distances are limited to 32K bytes, and lengths are limited to 258 bytes. When a string does not occur anywhere in the previous 32K bytes, it is emitted as a sequence of literal bytes. (In this description, "string" must be taken as an arbitrary sequence of bytes, and is not restricted to printable characters.)
What about Benchmarking Software?
Most standard benchmarking tools are not fully HTTP 1.1 compliant and almost none of them are capable of handling IETF Content encoding. If you use a standard HTTP benchmarking program that does not include the 'Accept-Encoding:" header with at least the gzip operand then the server will not (as per RFC standards) actually send any compressed data. Some benchmarking programs do not supply the "Accept-Encoding:" request field by default but do allow you to add it yourself via a command line parameter or special configuration file. Check the documentation for the benchmarking program itself. Everything will still work without the "Accept-Encoding:" field in the request but the benchmarking won't tell you much since it won't actually be receiving anything compressed. If you need a benchmarking or testing tool to measure the compression performance on your system and you don't have one that is capable of doing so... contact Hyperspace Communications Inc. They have developed custom versions of just about all major load generating and HTTP benchmarking tools that are capable of requesting and receiving standard IETF Content encoding(s).
Download the Free Apache mod_gzip Module
You can try HTTP compression on your site with Hyperspace Communications' Apache gzip module! mod_gzip was originally authored by a company named Remote Communications, Inc. RCI was purchased by HyperSpace Communications Inc. and HCI is responsible for maintaining the websites. Contact HCI for more details about mod_gzip. Remote Communications released the code into the public domain, the first ever module for the Apache Web Server which accelerates/compresses data on the fly. Available for Windows, Linux, and Solaris. Full source code included. The current version compresses dynamic output (from PHP, CGI, Perl, SSI's, EXE files etc.)
- mod_gzip home page (currently mod_gzip 1.3.19a for Apache 1.3)
- mod_gzip 2.0.26a - (experimental, for Apache 2.0)
Comments are welcome
Revised: April 20, 2002