(CONVERTED FROM HTML)
---------------------
Who Killed Gopher?
An Extensible Murder Mystery
By Rohit Khare // December 23, 1998
-----------------------------------
As best our forensics team can reconstruct, a serial killer first surfaced in
the beginning of 1991, born of an academic's midnight hack gone awry. NeXTstep
users, you see -- precisely the kind of object-oriented revolutionaries that
would think they could get away with inventing a brand new protocol for a
mature Internet.
The first victim was the campus phonebook: it dismembered the whole corpus
into tiny bits of `hypertext', left hanging together by search strings alone.
Here we see its modus operandi: not content to slice information into neat
lists or trees, it left behind completely unstructured graphs littered with
redirects hither and yon across the Internet to co-conspirators' servers.
Hegemonic fantasies drove it to assimilate more and more multimedia formats at
each node. Crude at first, shoveling any old file through an 8-bit clean
channel, it grew more explicit until it could label text, images, audio, even
compound documents. By now, it can even masquerade behind several renderings,
negotiating whatever face -- or language! -- pleases the victim.
As a protocol, it was a blunt weapon: aim a socket at the target, send a
one-line demand, and slurp down the response until the connection died. The
crazy way it grew from there is proof itself that the killer dropped out of
Application Layer Protocol design school at an early age it shows no
understanding of the basics of TCP, botching handshakes, slow-start,
server-sends-first-byte, even tripping over the Nagle timer. Instead, in a
stroke of streetwise punk genius, it stayed 'simple' enough that any two-bit
Telnet client could take a hit.
And what an addictive product it was! Workgroupies were seduced by the ease of
installing new servers and authoring new content -- and pushing free software
helped the lock-in cycle along. The killer preyed on individual users'
self-esteem by promising instant fame*, in the form of a backlink from the
master list of servers.
*Anyone have firm attribution for the famous quote,
"On the Internet, everyone will be famous for 15 people"?
Soon, it was assimilating more than bags of bits: it was a gateway drug to
harder services. Scripts, database lookups, even long-established news, file,
and email access transactions were reduced to mere documents in a menu. The
address sent in the 'demand' grew into a miniature RPC.
All the while, with every new user, every new server, every nugget of
information it wrested from 'legacy' protocols, the killer bled off a little
more of the Internet's bandwidth -- getting away with murder with every
handshake!
Such profligacy is no idle threat: for a while, the killer was the largest
share of application packet traffic on the national NSFNET backbone.
Or was that the victim?
Did the killer come knocking on port 70 or port 80? From boombox.micro.umn.edu
or info.cern.ch? Bearing the imprimatur of campus administrators or
physicists? Are we hunting gophers or spiders?
Dial 'H' For Murder
-------------------
In fact, the profile in Table 1 fits both suspects to a 'T'. In the early
'90s, an interesting drama played out between two contenders for the first
integrated, interactive internet information retrieval protocols -- neither
of which, arguably, had any technical reason to exist.
That's not to say the systems weren't inevitable: everyone suspected there'd
have to be an easier way to use the Internet in the '90s than UNIX shell
experience. That encompasses innovations in user interface, authoring formats,
and above all, resource identification. It does not, however, motivate the
original editions of the Gopher and HyperText Transfer Protocols. They were
painfully trivial hacks that did nothing FTP didn't already handle.
So here begins our mysterious tale: why either protocol ever rose to
prominence in the first place; and how the fratricidial drama eventually
played out. Understanding how HTTP killed gopher may lead us to the forces
which may, in turn, topple HTTP.
Science proves a blind alley, though. Their fates were decided not on
technical merits, but on economic and psychological advantages. The
'Postellian' school of protocol design focused on engineering 'right'
solutions for core applications (batch file transfer, interactive terminals,
mail and news relays) anchored in unique transport layer adaptations
(slow-start, Nagle timers, and routing as respective examples). Our two
specimens are 'post-Postel', in their details and in their adoption dynamics.
They are stateless; they don't have (Gopher) or dilute (HTTP) the theory of
reply codes; they scale poorly, imperiling the health of the Internet; and
they are 'luxuries' for publishing discretionary information, not Host
Requirements which must be compiled into every node.
Burrowing into Gopher
---------------------
While the Web was arguably invented a few months before Gopher, the latter was
the first to garner public notice. By the fall of 1994, RFC 1689's census of
information retrieval tools estimated there were 4,800 Gophers, and 1,200
anonymous FTP archives, but only 600 Webs. Early and "explosive" adoption had
a lot to do with Gopher's "fiercely simple" design, and hence its wide
availabilty on a number of students' (DOS, Windows, Mac, NeXT) and campus IS
departments' (UNIXes, VMS, VM/CMS, MVS) platforms.
It was designed by the microcomputer support folks at the University of
Minnesota for a teletext-like Campus-Wide Information System (CWIS). The
original prototype was a multicolumn tree-browser pioneered for the NeXTstep
file browser. Each scrolling column had a list of titles; clicking on an entry
would either display that file in the pane below, or load the child directory
listing in the column to the right.
Naturally, the messages on the wire follow directly: open a connection to the
target server on port 70; send a one-line 'selector' string; get back either
the file itself, or a list of further selector lines, each prefixed by a
character indicating the type of file it points to. Table 2 lists the
rudimentary type system Gopher shipped with, illustrating on one hand its
ambitions to assimilate new media types and on the other what a thin skein it
was around protocol- and platform-specific data formats.
The flip side of 'thin skein' is 'ease of implementation', though. In effect,
Gopher was a (very!) poor man's version of FTP, sans authentication,
navigation, authoring, or bulk/restartable transfers. It only used TCP as a
half-duplex fopen() call. It just happened that some of the "files" it
transferred were menus, in a Gopher-specific, tab-delimited, US-ASCII format.
So a Gopher server did little more than read files out of a specified
directory tree; and clients could be hacked out of spare Telnet client code.
A little further up the evolutionary tree, servers allowed scripts and other
programs to be served as well. Rather than copying a file back across the
socket, it connected it to some process's output. Similarly, multiprotocol
clients could recognize a non-Gopher selector and switch to another language
(as for phonebook queries). "Gateways exist to seamlessly access a variety of
non-Gopher services such as ftp, WAIS, USENET news, Archie, Z39.50 (1992 rev),
X.500 directories, Sybase and Oracle SQL servers, etc."
The base protocol didn't leverage TCP well, though. For example, TCP's
three-way handshake ends with a packet from the server to the client -- so
even when the client opens the connection, the server can send data first for
free. Gopher did not have a server-hello message, and thus no version number
to key upgrades off of. To signal the end of a particular transmission, Gopher
borrowed the SMTP/NNTP convention of .. This only worked for
its own menu format, though -- as soon as an actual file was transferred, TCP
FIN had to be used as the EOF.
This left lots of socket buffers in the FIN_WAIT state at the server side,
just like the Web. At the same time, the Gopher Team was planning to scale to
Internet-wide use. They did plan for redundant backup servers: a '+' line was
an alternate for the selector on the previous line. They maintained a master
list of publically-accessible servers, which was in turn indexed regularly by
Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives,
by analogy to the Archie FTP server index).
In the long-run, though, no one ever plastered Gopher selectors on shirts and
lunch trucks and golf tees...
Spinning the Web
----------------
"The key insight I credit Tim Berners-Lee with, is the URL: the idea that
there's a Uniform Resource Locator that says I can point at any bit of
information on the Internet." -- some obscure Web scholar in Stephen
Segaller's new book Nerds 2.01: A Brief History of the Internet
If anything, Tim's original Web manifesto was even more ambitious than
Gopher's. Its hook for assimilating "the entire universe of network-accessible
information" was the URI, which is able to subsume entire namespaces as merely
one more scheme (telnet:, mail:, and so on). URIs reduced operational
instructions on how to access a resource to a declarative address. For
example, contrast the one-line ftp: URL with the MIME message/external-body
part for FTP access -- logins, passwords, alternate sites, alternate
directories, alternate formats. As RFC 1630 intended "URIs may, if necessary,
be passed using pen and ink."
But while URIs were the innovative essence of the Web, the other two legs of
the triad were less steady. HTML is merely a markup language which makes it
natural to embed URIs as hyperlinks. URIs could just as well have been
popularized as a pointer format within Rich Text Format, PostScript, LaTeX, or
any other language. HTTP is merely a lookup protocol which makes it natural to
resolve http: URLs or gateway to other schemes. URIs could just as well have
been popularized as a naming interface to existing clients for FTP, Gopher,
News, etc.
Somehow, though, HTTP/0.9 took root. It is a famously 'simple' protocol,
arguably even less useful than Gopher. Here is its entire grammar:
Simple-Request = "GET" SP Request-URI CRLF
Simple-Response = [ Entity-Body ]
That's right: it only supported one method, GET, and required a new TCP
connection for every transaction. Even the 1.0 edition, which eventually added
authentication and navigation and simultaneous transfers of several files for
a single "page", had no excuse to be born when FTP was on the shelf. Except
that HTTP had broken authentication, nonstandard conventions for navigating
directories and welcome/index/home pages, and ran spectacularly afoul of
slow-start for each of its typically-small file transfers.
If HTTP/0.9 was really shipped for no better reason than that programming
separate FTP-Control and FTP-Data channels seemed too hard, why did it
survive? Many of the earliest "web" sites were, in fact, just ftp: URLs. The
saving grace was 1.0's new metadata headers. It augmented the one-line request
with additional fields for authentication, arguments modifying the method, and
information about the user-agent's preferred languages and formats. The entity
bodies it returned were prefixed by MIME Content-Types, last-modified dates,
base-URI, and more. This information, finally, was what took HTTP beyond the
realm of file transfer to hypertext transfer.
The headers are how HTTP's adoption cycle diverged from Gopher's. Both servers
were simple to install and author content for -- there was even a hybrid gn
server which could vend files over both port 70 and 80. There were kudos for
new sites, in the form of the Web Virtual Library project at CERN, and later
W3C, and early search engines like Brian Pinkerton's NeXTstep
DigitalLibrarian-derived WebCrawler. The client was the defining wedge.
Graphical web browsers leveraged rich HTML text with MIME media type headers
to display graphical images, to launch helper applications, and present
multilingual content.
Extensibility: the fingerprint of a killer
------------------------------------------
The straw that broke the Gopher's back wasn't even a feature of HTTP per se.
The synergy of HTML and HTTP in the graphical browser began with FORMs,
through refresh META tags, up through scripting and applet OBJECTs. While one
protocol stayed lean, the other opened up so much headroom that it grew into a
beast so complex even today's base 1.1 specification takes 160+ pages.
An avalanche of new headers modified GET to only retrieve the full entity
if-modified-since an existing copy, or only a specified byte-range; enhanced
security with keyed-digest passwords; even reverse-engineered state back into
HTTP using 'cookies.' HTTP reply codes were added for payments, for cache
validation, for upgrading to new protocols.
Entirely new functionality could be grafted on HTTP using new methods. WebDAV
added LOCK and PUT; collections which allowed MOVE and COPY on entire swaths
of URLspace; and even went beyond new headers to encoded additional parameters
in XML. Conversely, HTTP was borrowed wholesale for new protocols for printing
(IPP), multiparty call setup (SIP), and event notification (RVP).
Politically, any server or client could inaugurate a new method; third-parties
could even thread themselves into the proxy chain and filter messages in
transit. Proxies exist to add public annotations (Crit-Link Mediator),
anonymize access (Lucent Personal Web Assistant), convert file formats
(ProxiNet for PalmIIIs), language translation (AltaVista's BabelFish can be so
hacked), and content filtering (PICS). Decentralized extensiblity was key to
HTTP's adoption. Rather than waiting for the IETF standards process to revise
a canonical table of one-letter content codes, for example, MIME headers could
deploy new content-types on the spot.
At the same time, uncontrolled divergence reduces the value of the Web
platform as the probability of compatible extensions matching up declines. New
versions of HTTP would appear to be one solution, but there is little
political will to empanel a standing working group to review a grab-bag of
extensions for 1.2, 1.3, ... 39.7 and so on. What is neccessary? Some hook for
identifying extensions, the level of support required, and which parts of
URLspace are so extended.
W3C has been promoting such mechanisms for years -- three calendar years! --
to better "Lead the [Controlled] Evolution of the World Wide Web." At first,
the motivating scenarios were complex economic and social negotiations
(payments, content selection, privacy, cryptographic algorithms, &c). Early
revisions of PEP advertised preference lists of alternative extensions,
applied in pipelined sequence. PEP could also transport metadata about which
extensions applied to other resources and required orders.
Though PEP remained a W3C experiment, the basic ideas of 1) packaging an
entire extension -- methods, headers, encodings and all -- behind a single
name and 2) indicating clearly if an extensions was required or optional for
3) the next hop or end-to-end were reincarnated as the Mandatory proposal. The
current draft defines four header fields enumerating the URIs of the
extensions which apply (Man: and Opt: for client and server; C-Man: and C-Opt:
for the next hop). To interoperate with the installed base, any request with
mandatory extensions prefixes the method name with 'M-', which forces 1.0 and
1.1 servers to fail with "501 Not Implemented."
Transport: the Achilles' heel
-----------------------------
By April 1995, Web usage passed Gopher's share as well as every other Internet
application protocol as a fraction of NSFnet backbone usage. Even patching its
bandwidth leaks with 1.1's persistent connections can't mask HTTP's
inefficient waste of Internet resources. Its content base has grown so immense
that caching alone can't reduce bandwidth demand by a significant margin for
the average user or ISP. Former strengths have now become liabilities: full
'English' headers, stateless transactions, full enumeration of client
capabilities once critical for debugging now inflate request packets while
progressive rendering of composite web pages places a premium on simultaneous
downloads (aggravating the share of packets wasted on handshakes and delayed
by slow-start).
Individually, these phenomena have appeared in application protocols before --
and were nipped in the bud by active involvement by transport protocol folks,
or coevolved with TCP. When Telnet generated a packet per keystroke, even on
links where you could type faster than the ACKs returned, the Nagle timer
throttled TCP down to batching keystrokes into a single packet per round-trip.
When file transfers immediately swamped LANs, slow-start eased up to limit
congestion.
So when HTTP presents its demand for many small transfers in parallel
immediately, there are solutions on the shelf. Some vendors are already
marketing "accelerators" which rely on matched encoders at server and client,
like Sitara's SpeedServer. Unfortunately, these approaches aren't backward
compatible with HTTP as-is -- nor as simple to code.
W3C's HTTP-NG (Next Generation) project sees it as a three-layer problem:
message transport, remote invocation, and "The Classic Web Application"
(TCWA). At the bottom, they propose a multiplexing protocol which combines all
the traffic destined for a single web server into a single TCP connection.
This allows the conversations regarding individual transfers to be diced up
independently: an intelligent server might respond with the first few critical
bytes of all the images in the first packet. A fully specified W3Mux layer
still needs to address flow control (gating the relative priority of the
subchannels without starving them), simultaneous open (which TCP does allow),
and lots of interoperability testing.
The middle layer 'compresses' the data to be transferred by marshalling
arguments in native data types rather than English. So numbers would be sent
as integers; dates as binary fields; and strings as reusable tokens. The tip
of their proposal (reaching into the cloudy heights of improbability, some
critics might say) is reengineering the Web as a distributed object system,
modeled as operations on data structures rather than documents.
Building the Perfect Beast
--------------------------
Having slain Gopher (and a host of other competitors), the Web appears to be
expanding in all directions today. Just as the SMTP state machine defined a
style or protocol design for a generation, every new application these days
seems to ape HTTP's state-less style -- if only to piggyback a free ride
through firewalls at port 80. The Internet Architecture Board is on the verge
of issuing guidelines to rein in this tendency. First, the HTTP port should be
reserved for protocols which manipulate the kind of information already in web
servers. WebDAV is OK, since it extends documents; IPP would not, since it
uses POST as a barn door for its printing procedure calls. Second, HTTP should
not be referenced by-copy. Lots of other efforts want to 'borrow' a few
headers for authentication or navigation and end up copying text into other
specs, raising the scepter of divergent versions. Third, HTTP is not a valid
precedent for MIME users -- it plays fast and loose with certain rules which
enable absolute interoperablity through mail gateways. HTTP allows bare
and as well as inventing a new Content-Encoding: process.
The IAB guidelines are a reminder that HTTP still occupies a limited niche in
the space of possible Transfer Protocols. True, its message format can encode
any MIME entity, even live data streams -- but only from the server to the
client. True, its address format can point at anything -- but even then it
can't encode whether the resource is expected to be secured, paid for, or
otherwise reprocessed. And it is most certainly restricted to one-to-one,
synchronous, request-response (client-pull) interaction. All the various hacks
to 'push' data to groups of subscribers are, well, hacks: automatic polling,
shared caches, and "we will mail your response within two working days".
The moral, then, for would-be sucessors to HTTP is to pick another niche and
follow the network effects. Internet-scale event notification, for example,
requires true push and subscriber management. Current efforts to shove
presence information through HTTP may succeed, but any larger shift to
event-based integration could catalyze a new, symmetric protocol which allows
servers to initiate connections back to clients. Simply compressing existing
HTTP traffic without adding new functionality faces a powerful entrenched
competitor. Adding programmability, whether betting on HTTP-NG or the
Mandatory scheme, is an attempt to harness network effects between all the
communities that have been angling to extend HTTP to their application. Yet,
the rewards for cooperation are spread thin: only the server which
simultaneously supports HTTP, WebDAV, and Internet Printing Protocol would
gain from a common syntax for those extensions.
And now we are left back where we started: the mystery of protocol adoption.
Reuse, extensiblity, simplicity are all tricks to reduce the cost of
implementing a new protocol -- but that doesn't mean the market as a whole
will make the most efficient decision. If you're a Gopher, it may just bury
you :-)
The morbid conceit of this article was inspired by the first-rate book Aramis
by French historian of technology Bruno Latour. It told the tale of two
decades of R&D on a Parisian peoplemover system in the first person -- the
train itself speaking of its struggle to become, and its death.
Table 1. Fingerprints of a murderer -- Spider or Gopher?
--------------------------------------------------------
Vitals
Academic NeXTstep hack, about 8 years old; birthmark at port 70 or 80
First Victim
Notorious primary use was for mere phonebooks; gatewayed queries to
existing campus database and represented results as hypertext nodes
Disguise
Began without any meaningful type system at all, but soon grew past
file extensions to vend multimedia information, even negotiating
formats
Modus Operandi
Open a new connection, send a one-line demand for the information at a
given address, then receive the response until connection dies
Come-on
Infiltrate small workgroups with small, easy, and free tools,
promising acceptance with a simple `backlink' from a master list of
servers
Serial Victims
Reaching beyond the degenerate case of static information, its
addresses became entry points to scripts, databases, and other
`stateless' servers
Motive
Even its first manifesto laid out hegemonic plans: to incorporate, by
reference or by proxy, every other Internet information service.
Table 2. Gopher's rudimentary type system used character codes to prefix
selector lines. Gopher+ later added content negotiation.
------------------------------------------------------------------------
0 File
1 Directory
2 CSO phone-book server
3 Error
4 BinHexed Mac file
5 DOS binary file
6 Uuencoded Unix file
7 Index-Search server
8 Text-based telnet session
9 Binary file
+ Redundant server
T Text-based tn3270 session
g GIF image file
I Any image file
Table 3. RFCs and Internet-Drafts discussed in this issue.
----------------------------------------------------------
_RFC _Date _Title and Comments
1436 Mar 1993 The Internet Gopher Protocol
Described a `fiercely simple' request-response
protocol and menu format
1630 June 1994 Universal Resource Identifiers
Raised the stakes from Gopher's own selectors
to any namespace
1689 Aug 1994 Status Report on Networked Information Retrieval:
Tools and Groups
A wonderful archaeological census of competing
protocols, many now extinct
1945 May 1996 HyperText Transfer Protocol 1.0
Also defined the more primitive 0.9 GET. Even
1.0 only added HEAD and POST
2068 Jan 1997 HyperText Transfer Protocol 1.1
A vastly expanded contender with better
caching, security, and extensibility
2227 Oct 1997 Simple Hit-Metering and Usage-Limiting for HTTP
Entire RFC defined a single Meter:
2396 Aug 1998 Uniform Resource Identifiers: Generic Syntax
The deceptively small definitive grammar took
years to hammer out
TBA Dec 1998 Extensions to HTTP for Distributed Authoring
Adding locks, collection-resources, and
namespace operations took major reengineering
Draft Nov 1997 PEP -- an Extension Mechanism for HTTP
`Protocol Extension Protocol' had negotiation
policies for pipelining message processors
Draft July 1998 HTTP State Management Mechanism
Cookies took three years from a quick, risky
hack to a reasonably secure mechanism
Draft Aug 1998 On the use of HTTP as a Substrate for Other Protocols
Official -- conservative -- IESG and IAB
guidance on port assignment, URI schemes, and
HTTP security
Draft Sep 1998 HTTP Authentication: Basic and Digest
A companion specification for nonexistent
(UUencoded passwords!) and new security.
Draft Nov 1998 HTTP Extension Framework
Pared down to headers which label `Mandatory'
and optional extensions of each message
Draft Nov 1998 Upgrading to TLS Within HTTP/1.1
Using the Upgrade: for Transport Layer
Security, in lieu of a dedicated port like
https: at 443