Next: 2 A Basic Model Up: On the Construction Previous: On the Construction

1 Introduction

Information storage and retrieval systems (catalogs, commercial online services, bibliographic utilities, CD-ROM products, etc.) have developed rather empirically as demand and enabling technology have evolved. Filtering systems (selective dissemination of information (SDI) systems, e-mail filters, etc.) have also developed pragmatically.

Hitherto, these two types of selection systems were usually developed and built in relative isolation: a local retrieval or filtering system designed to work with a specific dataset in a particular domain for specific users and interests, and employing or assuming various kinds of external knowledge (thesauri, etc.). Perhaps the practical consideration that such systems have to be complete in order to work at all has encouraged an emphasis on the construction and comparison of complete systems rather than on the individual components (or subsystems) of such systems.

1.1 Purpose

Three considerations now encourage detailed analysis of the components of selection systems:

Academic curiosity: Can all information storage and retrieval systems (or, better, all selection systems) be viewed as composed of a common set of components? If so, what are they and how many are there? Which are necessary and which are sufficient?
The recent Text REtrieval Conferences (TREC) [Harman 93] have provided a welcome revival of interest in the comparative evaluation of retrieval and filtering systems. We suggest, however, that there are significant limits to the benefits that can be derived from comparing whole, complete systems. Sooner or later, the advanced design and evaluation of selection system performance also requires the systematic comparative evaluation of alternatives at the level of individual components within complete systems.
In the emerging network environment selection systems have moved away from the traditional ``unitary'' model of one retrieval (or filtering) engine operating on one dataset. We now have a situation which we have called ``extended retrieval'' [Buckland n.d.]. It is easy to think of multiple retrieval engines connected to each other and to multiple databases over networks. But so simple a view begins to break down as soon as one begins to examine how extended retrieval might work: Where are the indexes, for example? Are they part of the their respective databases on the server or part of the client retrieval engine? In the NISO Z39.50 Search and Retrieval protocol (cf. ISO 10162 & 10163) an EXPLAIN function is being developed to enable the client to ascertain the available options and constraints of the server. What, in principle, could the server explain about itself that might be useful to the client?

In brief, a general conceptual framework and vocabulary for the components of selection systems is needed. This paper seeks to analyze the ``anatomy'' of selection systems. Such analysis should advance the theory of selection systems: What are the components of retrieval and filtering systems? Which are the necessary and sufficient components and which are optional? What different types of components are there? Which functionally similar techniques might be substitutable within any of the components? Which might be substitutable across different, but similar components? In what different ways can the components be combined to design more sophisticated systems? Our hope is that a functional analysis of components will stimulate the design of improved selection systems.

We will first propose a basic functional model of information storage and retrieval systems and discuss these components in some detail. Next, we reduce the non-data components of the system model to two functional types, transformers and partitioners. This is followed by a generalization of the model to other similar selection tasks. Finally, we comment on some of the implications for the study, use and design of selection systems. We approach this in the context of bibliographic and text systems, but believe the approach to be of general applicability.

1.2 Terminology

Throughout this paper we will be using several words and phrases in specific technical ways.

System boundaries define what is considered the ``system'' rather than the ``environment''. Inputs flow into the system, are processed, and eventually emerge as output. If the scope of the system is expanded, i.e. additional processes become incorporated into the system, then the system boundaries are moved to include more of what was previously part of the environment.

In examining the decomposition of selection systems into their functional components, a series of processes is found: objects are processed into modified objects, which are, in turn, affected by other processes to become further modified objects. The granularity of the analysis is somewhat arbitrary: processes can typically be broken down into finer and finer subprocesses. Hence the level of analysis (the extent to which subsystems are defined) can reasonably depend on the purpose of the analysis.

A transforming operation in this context is the mapping of some procedure across each of the members of one set in order to derive a new transformed set of objects. It is necessarily a one-to-one mapping from the original set to the new set, where each member of the new set is a (possibly) modified copy of its corresponding member of the original set. A simple example of such an operation is copying. Each member of the original set is copied into a new derived set.

At some level of generality all information selection systems processes can be thought of transformations from one state to another, but, for the present purposes, the distinction between two types of transformation appears useful:

Representation Making. Using rules to derive a representation (a copy or a version) of a datum into a corresponding, modified datum. Data are changed or at least copied.
Partitioning (sorting, selecting) a subset of data objects according to some criterion expressed as a query for a matching process or as an ordering rule. Data are reorganized rather than changed.

The term ``retrieval'' tends to subsume three meanings: selecting (identifying); locating (lookup); and fetching (delivery). The first meaning -- selecting (identifying) -- is what interests us here. We follow [Belkin & Croft 87] who provide a useful classification of retrieval techniques and characterize the process as a matter of comparing and matching, either exactly or partially. The variety of retrieval techniques -- the form and degree of acceptable comparability -- is very large: exact match; partial match; match using truncation; fuzzy, positional, and other relationships; Boolean matches; etc. Multiple techniques can be combined and there are limitless degrees of progressively weaker matching. We follow [Belkin & Croft 87] in regarding the retrieval process itself as a comparing or matching process. However, the purpose or function (as distinguished from the procedures) of this matching is to partition the stored representations into a set of subsets.

In information selection systems, Representations are partitioned into the two subsets: retrieved and not-retrieved, as in basic Boolean systems. But there can be degrees of matching and each different degree of matching can be used to create another partition. The limit is reached in document-ranking systems in which, at least in principle, each representation is partitioned into a separate subset with one member. We can, therefore, while accepting that the process is a matching procedure, emphasize that it is functionally a partitioning activity. With this in mind, we can regard the formal query as being a partitioning instruction. It may sound odd to refer to information retrieval as ``partitioning with respect to relevance'' but that is an accurate statement of the intent.

In brief, while the process may be one of matching, the function is one of partitioning and we can conclude that this is a different kind of operation from, say, copying. Sorting is logically the same as partitioning: To sort into categories is to partition into categories.

Next: 2 A Basic Model Up: On the Construction Previous: On the Construction