Wide Area Information Servers (WAIS) over Z39.50-1988 and Beyond by Margaret St. Pierre, WAIS Incorporated February 1994, ConneXions, The Interoperability Report Introduction The network publishing system, Wide Area Information Servers (WAIS) [1], is designed to help users find information over a computer network. The principles guiding the design of the WAIS system are: * A wide-area networked-based information system for searching, browsing, and publishing. * Based on standards. * Easy to use. * Flexible and growth oriented. From the systems that have grown out of these principles, a large group of developers, publishers, standards bodies, libraries, government agencies, educational institutions, and users have been enjoying the benefits of shared information. WAIS development began in October 1989 with the first Internet release occurring in April 1991. From the beginning, the goal of the WAIS network publishing system was to create an open architecture for information retrieval by using a *standard* computer-to- computer protocol. The underlying protocol was based on the 1988 Version of the NISO (National Information Standards Organization) American National Standard Z39.50 Information Retrieval Service Definition and Protocol Specifications [2]. The WAIS implementation is still in use today resulting in over 50,000 users of Z39.50-1988 on the Internet, with an even greater number of users acquiring access via a suite of tools and services, for example, Gopher, World Wide Web, and America OnLine gateways. The Z39.50-1988 standard originally grew out of the library community as a search and retrieval protocol for bibliographic data. The design goals for the WAIS system required a more general information search and retrieval protocol. By working with the Z39.50 Implementor's Group (ZIG), WAIS developers used a recommended subset of Z39.50-1988 and a specific set of assumptions to fulfill its requirements. Over time, many of these requirements have then gone into the definition of subsequent versions of Z39.50 [3,4]. This article describes the subset of Z39.50-1988 used and the additional assumptions made to meet the design goals of the WAIS network publishing system [5]. In addition, the new development activity taking place on the next generation WAIS systems is also presented. Concepts The WAIS architecture has four main components: the client, the server, the database, and the protocol. The WAIS client is a user- interface program that sends requests for information to local or remote servers. Clients are available for most popular desktop environments. The WAIS server is a program that services client requests, and is available on a variety of UNIX platforms. The server generally runs on a machine containing one or more information sources, or WAIS databases. The protocol, based on Z39.50, is used to communicate between WAIS clients and servers. The WAIS system performs two basic operations: search and retrieval. Each operation is a two-step transaction made up of a request sent by the client to the server followed by a response sent by the server back to the client. A search request primarily contains information on what database to search and the corresponding query, where the query may contain a natural language question, a Boolean expression, and a set of documents for use in relevance feedback. Relevance feedback is the ability to select a document, or portion of a document, and search for a set of documents similar to the selection. A search response is composed of a relevance-ranked list of WAIS Citations, where each citation contains summary information associated with the document. A WAIS Citation provides enough descriptive information on the document for a user to determine if a full retrieval is desirable. A retrieval request mainly contains a document identifier [6], which fully specifies the name and the location of the requested document. And finally, the corresponding retrieval response returns the desired document. Design Goals As an aid to understanding the original WAIS implementation and its use of Z39.50-1988, the historical design goals of WAIS are presented in this section. Each design goal is accompanied by a brief description of the Z39.50 assumptions, the details of which are described in the next section. * *Provide users with access to bibliographic and non-bibliographic information, including full-text and images.* Because Z39.50-1988 grew out of the bibliographic community, additional assumptions with the protocol were required to serve non-bibliographic information. They were also necessary to serve documents existing in multiple formats (e.g., rtf, postscript, gif, etc.). * *Keep the search query simple and independent of changes in server functionality.* Most client implementations of Z39.50 parse the user's query into a Type-1 RPN (reverse-polish notation) query where each term in the query is associated with a set of bibliographic attributes. For WAIS queries, a new Type-3 query type was assumed, which eliminated the need for the client to parse the query. The client was also liberated from the responsibility of knowing what bibliographic attributes were supported by the server. * *Provide relevance feedback capability.* Since relevance feedback is specified in the search query, relevant documents were assumed to be part of the new Type-3 query type. * *Permit the server to operate in a stateless manner.* In Z39.50, a search results in the creation and maintenance of a Result Set on the server, where subsequent retrieval requests are made with respect to this Result Set. In order for the server to operate in a statelessly in a WAIS system, an alternative approach was required to eliminate the need for maintaining Result Sets. * *Provide the ability for a client to retrieve documents in pieces.* Because retrieval of a portion of a document could be done several ways with Z39.50-1988, assumptions were made to implement this functionality. Accessing a portion of a document was required for both retrieval and relevance feedback. * *Run over TCP.* The Z39.50-1988 standard was designed to run in the application layer using the presentation services provided by the OSI (Open Systems Interconnection) Reference Model. Due to the popularity of TCP/IP and the Internet, WAIS was designed to run over TCP. Use of Z39.50 over TCP is described in [7]. The Protocol WAIS supports the Init and Search Services of Z39.50-1988, where each Service is made up of a request from the client followed by a response by the server. To meet the stated design goal of maintaining a stateless server, both the WAIS search and retrieval functions are implemented using the Z39.50 Search Service. The Z39.50 Present Service is not used for retrieval. since it requires that the server maintain state, or Result Sets, between operations. Because the Z39.50 Search Service request contains a query and a corresponding query type, a WAIS search is distinguished from a WAIS retrieval by the query type. A WAIS search is implemented using the newly-defined Type-3 query, and a WAIS retrieval is implemented using a Z39.50 Type-1 query. A WAIS search is initiated by the client with a Z39.50 Search Service Request APDU (Application Protocol Data Unit) using a Type-3 query. The query contains two main fields, the seed words and a list of document objects. The *seed words* contain the text typed by the user. A *document object* refers to a full document, or portion thereof, to be used in relevance feedback. Each document object contains a document identifier, type, chunk-code, and start and end locations. The document identifier and type specify the location and format, respectively, of the document. The chuck-code determines the unit of measure for the start and end locations. Examples of chunk-codes used include byte, line, paragraph, and full document. If the chunk code is a full document, the start and end locations are ignored. A Search Service Response APDU returned by the server contains a relevance-ranked list of records, or WAIS Citations. A WAIS Citation refers to a document on the server. Each WAIS Citation contains the following fields: * Headline - a set of words that convey the main idea of the document. * Rank - the numerical score of the document based on its relevance to the query, normalized to a top score of 1000. * List of available formats - e.g. text, postscript, gif, etc. * Doc-ID - the document identifier for the document. * Length - the length of the document in bytes. The number of WAIS Citations returned is limited by the preferred message size negotiated during the Init Service. A WAIS retrieval of a document is initiated by the client with a Search Service Request APDU using a Type-1 query. The query contains up to four terms: the Doc-ID, a document format, the start location, and the end location. The Doc-ID is obtained from the WAIS Citation sent to the client during a previous WAIS search, and the document format is selected by the user from the list of available formats supplied in the WAIS Citation. Because full-text and images are often larger in size than the receive buffer of the client, clients are designed to optionally retrieve documents in chunks, specifying the start and end positions of the chunk in the query. The Z39.50 Use and Relation bibliographic attributes taken from the Bib-1 Attribute Set are used to distinguish each term in the Type-1 query. The Use and Relation Attributes associated with the terms in the WAIS Type- 1 query are specified as follows: * Doc-ID - Use: system-control-number, Relation: equal * Document format - Use: data-type, Relation: equal * Start location - Use: paragraph, line, or byte, Relation: greater-than- or-equal * End location - Use: paragraph, line, byte, Relation: less-than The Use Attributes of data-type, paragraph, line, and byte are not part of the Z39.50 Bib-1 Attribute set, and are assigned the unique codes, "wt", "wp", "wl", and "wb", respectively. An example of a fully- specified retrieval query is: query = ( ( Use = system-control-number, Relation = equal, term = ) AND ( Use = data-type, Relation = equal, term = postscript ) AND ( use = byte, relation = greater-than-or-equal, term = 0 ) AND ( use = byte, relation = less-than, term = 2000 )) A retrieval response is issued by the server with a Search Service Response APDU. In this case, a single record corresponding to the requested document, or portion thereof, is returned in the specified format. The Next Generation Since the first release of the WAIS system, the Z39.50 standard has been significantly enriched and its popularity has increased considerably as evidenced by the growing numbers of registered implementors, nationally and internationally. Representation within the Z39.50 has expanded to not only include representatives from the librarian community, but also representatives from government agencies, educational institutions, and the commercial sector. A number of new standards have emerged to meet the evolving needs of the networked information retrieval world. These new standards serve to complement Z39.50. For example, document identifiers can be specified by the URI (Universal Resource Identifier) standards [8,9]. Document formats could be given using MIME (Multi-Purpose Internet Mail Extensions) Content Types [10]. Languages and character sets also have corresponding standards [11,12]. Commercial systems require additional standards, such as security and authentication [13]. For wide-area acceptance by the masses, a complete information retrieval system should make use of a number of these open standards. Activity is underway to develop the next generation WAIS systems based on these new standards. At the core of the next generation WAIS systems is the WAIS Profile of Z39.50-1992 which was approved by the OIW (Open Systems Environment Implementors Workshop) SIGLA (Special Interest Group in Library Applications) in December 1993. It specifies full conformance with Z39.50-1992, and requires use of the versatile Z39.50 Generic Record Syntax [14]. Also built into the Profile is the flexibility to use many of the newly emerging standards. This next generation work is based on the same guiding principles as the original WAIS network publishing system: a wide-area networked-based information system for searching, browsing, and publishing, based on standards, easy to use. and flexible and growth oriented. References [1] "Information Service for Corporate Users: WAIS", Brewster Kahle and Art Medlar, ConneXions, The Interoperability Report, Volume 5, Number 11, November 1991. [2] National Information Standards Organization (NISO). American National Standard Z39.50, Information Retrieval Service Definition and Protocol Specifications for Library Applications, New Brunswick, NJ, Transaction Publishers; 1988. [3] ANSI/NISO Z39.50-1992 (version 2) Information Retrieval Service and Protocol: American National Standard, Information Retrieval Application Service Definition and Protocol Specification for Open Systems Interconnection, 1992. [4] Z39.50 Version 3: Draft 8", October 1993. Maintenance Agency Reference: Z39.50MA-034. [5] Internet Draft, "WAIS over Z39.50-1988", Margaret St. Pierre, Jim Fullton, Kevin Gamiel, Jonathan Goldman, Brewster Kahle, John A. Kunze, Harry Morris, and Francois Schiettecatte, November 1993. [6] "Document Identifiers, or International Standard Book Numbers for the Electronic Age", Brewster Kahle, Thinking Machines Corporation, see URL=, September 1991. [7] Internet Draft, "Using the Z39.50 Information Retrieval Protocol in the Internet Environment", Clifford Lynch, November 1993. [8] Internet Draft, "Uniform Resource Locators", Tim Berners-Lee, July 1993. [9] Internet Draft, "Uniform Resource Names", Chris Weider and Peter Deutsch, October 1993. [10] RFC 1521, "MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", N. Borenstein, N. Freed, September 1993. [11] Library of Congress, USMARC Code List for Languages, 1989. [12] RFC 1345, "Character Mnemonics & Character Sets", K. Simonsen, June 1992. [13] RFC 1510, "The Kerberos Network Authentication Service (V5)", J. Kohl and C. Neuman, September 1993. [14] Special thanks to John A. Kunze of the University of California, Berkeley, for his pioneering efforts on the development of the Generic Record Syntax. For more information on WAIS, see URL=. Internet mailings lists on WAIS include: wais-talk@wais.com, and wais-discussion@wais.com, or the newsgroup comp.infosystems.wais. For freeware WAIS clients and servers, see URL=. For additional information on freeware WAIS, contact CNIDR (Clearinghouse for Networked Information Discovery and Retrieval), 3021 Cornwallis Road, Research Triangle Park, NC, 27709, (919) 248-1499, or send questions to freewais@cnidr.org. For information on commercial WAIS products and services, contact WAIS Incorporated, 1040 Noel Drive, Menlo Park, CA, 94025, (415) 327-WAIS, or send questions to info@wais.com.