SEARCHES IN A COLLECTION. Draft of February 29, 2000.

Michael Buckland

TAXONOMY OF SEARCHES
We assume that the searcher desires one or more documents, but does not know what, if any, suitable documents exist and, even if the existence and identity were known, the searcher does not know where copies can be found. The problem of deciding how to look can be reduced to three operational questions: How to formulate the initial query; How to re-formulate the query; and, When to stop searching? How to formulate the initial query (i) and how to re-formulate a query (ii) can both be considered to be the same task in slightly different circumstances.

We categorize searches in the following ways:

Types of search I: Instances.

(a) An "every instance" search is one in which a search is deemed satisfied when every different document with the specified characteristics have been found. Since any collection might contain an instance, an "every instance" search requires that every library be searched. In this discussion we are concerned with instances of different document (types) rather than with duplicate copies (tokens) of the same document. Ordinarily duplicate copies are not of interest, after the first, because redundant. However, seeking every copy (token) of each different document (type) may sometimes be needed. We call this a "census search". A traditional example is the compilation of a census of all known copies of particular kinds of early printed works. An example that becomes more feasible with digital libraries (and more needed because of version control problems) would be locating all known copies of document in order to make correct or harmonized the different copies.

(b) An "any single instance" search is a search that is deemed satisfied when any instance of a document with the specified characteristics has been found.

(c) An "any N instances" search is a search that is deemed satisfied when any N instances of documents with the specified characteristics have been found. The previous search (b) "any single instance" can be treated as a special case of "any N instances" searching where N equals 1.

Although an "every instance" search is different in kind from "any N instances" searches, it, too, could also be treated as an example of an "any N instances" search if N is set high enough to ensure that all sources will be searched before N instances have been found.

(d) An "extreme instance" search is a search for whatever document has an extreme value for one or more attributes, e.g. the most recent document of some type.

Types of search II: "Good" (or "preferred") documents.

Conventional Boolean search systems in operate on primitive, unambiguous binary distinctions: Documents have been categorized with predetermined attribute values, such as being "about information retrieval" or "by Boccaccio." Of all possible values of all defined attributes, the presence of one or more is specified as a requirement for retrieval. The presence of other attribute values is, formally, a matter of indifference so far as the formal search specification is concerned. In real life searchers would like one or a few "good" documents, rather than just any one or few instances of documents with the specified characteristics. In other words, although a few randomly selected documents that satisfy the search specifications might sometimes be satisfactory, it is likely that a few good books on, say, information retrieval theory, are desired. So support for searching for "good" documents is a desirable design requirement.

We regard searches for "good" books as differing from searches for "any" one or more instances by being characterized by preferences, i.e. "I want some books about information retrieval and, given a choice, I would prefer them to be recent, in English, and by a well-reputed author." Of course, in the design of formal systems we can only be concerned with situations in which preferences can be operationally specified.

Specification.

Mechanical information systems require that a search has to be expressed formally, but, in practice, searches more complex than simple "look-up" are unlikely to be completely specified for several reasons. Not all attributes are present in the data; some attributes that are present may not be searchable; and only a limited number of forms of search are supported (e.g., right-hand truncation but not term adjacency). Complex specifications are inconvenient to formulate and searchers have a low tolerance for complexity in search specification. Empirical studies of online library catalog usage have consistently shown that functionality for specifying complex searches is little used (e.g. Norgard et al. 1993).

The final selection is ordinarily not done by the information retrieval system, but by the human searcher, picking over, with criteria that are probably subjective, ill-defined, and, possibly, unconscious, the set eventually generated by the interaction with the retrieval system. In this way limitations of mechanical retrieval systems are alleviated. Exploring how far support for this final human sorting can be incorporated into the formal retrieval system is itself a valid part of the agenda for research and development in selection systems. However, we conclude that the design task is not to try to automate this final choice, but, rather, to empower the human searcher to understand the options better.

For these reasons, we believe that the design of retrieval systems should be include support searches for "some good" documents. Searches for "some good" documents are characterized by preferences, e.g. "Given a choice, prefer a document that is recent". Hence, in addition to the binary values of conventional Boolean searching (i.e. the specification of attribute values that are either mandatory or else a matter of indifference), we assume that systems designed to support searching for "good" documents need to be able to handle a three-fold approach to attribute values:

-- Required (i.e. mandatory);
-- Conditional ("Given a choice,..."); and
-- Indifference.

Specifying a preference, a conditional requirement, means that the attribute itself is situationally significant, taking effect if and when prescribed circumstances are encountered. There may well be multiple preferences, e.g. "I'd prefer an English version and, if possible, the 1934 London edition." So a conditional, adaptive approach appears to be particular useful:

The population of (more or less) desirable documents in a collection is ordinarily unknown and difficult to predict. If prediction were easy, then it might not be difficult to formulate a more or less restricted binary Boolean search that would yield a set of approximately the desired size of retrieval set. There is substantial empirical evidence that difficulty in predicting the size of retrieved sets is a major practical problem in searching bibliographies and online catalogs.

ADAPTIVE SEARCHES FOR "N GOOD" INSTANCES

Simply stated, preferences are conditional. They take effect situationally, adaptively. The more difficult it is to predict the outcome of searches, the more desirable it is to develop adaptive search strategies that are adaptive in the sense of incorporating conditional specifications as noted above. The larger the collection the greater the difficulty of predicting what would be found and the greater the need to support adaptive, heuristic searching able to incorporate preferences.

Retrieval systems generate a more or less weak ordering of documents. At one extreme, simple Boolean techniques yield a binary partitioning: Retrieved and Not-retrieved. The problem is that with such a weak ordering it is not practical to predict the absolute size of the retrieved set. Therefore, if, as is usual, one does not want every instance retrieved, it is not realistic to expect to be able to formulate a search statement that yields any specified number of documents. Hence, with retrieval techniques with weak ordering, one has to modify the search statement iteratively until a suitably-sized retrieved set results. The "FEWER" command of the OASIS experimental prototype provides computer-assisted support for this within a single database environment (Buckland, Butler, Norgard & Plaunt 1992). Generalizing this approach to multiple database environment would provide the basis in an environment of simple Boolean services for an intelligent "knowbotic" agent capable of adapting search statements adaptively in response to what is found.

Retrieval techniques with stronger ordering capability, such as Boolean systems supporting weighted search queries, have less need for adaptiveness during the search. Probabilistic retrieval systems ordinarily yield rather strictly ordered retrieved sets, but the first few listed are not necessarily the preferred few.

The search for "N good" instances of the form "given a choice, I would prefer" implies an adaptive retrieval strategy that can respond dynamically to the results encountered, either during the search or in post-search processing of retrieval results. A search for "N good" instances depends on an explicit specification of preferred values for one or more attribute-values as the basis for "better" to be determined.

In terms of the present analysis this can be viewed as an elaboration of the data presented in Table 2. There would need to be an additional dimension to the table for a disaggregation of the values of each attribute (e.g. date of publication) upon which preferences would operate. Some empirical examples of such data can be found in Buckland, Butler, Norgard and Plaunt (1992).

SAMENESS AND SUBSTITUTABILITY

The question of whether two instances of a document are really "same" or not becomes important. A rigorous position is that, strictly speaking, there is no such thing as "the same" with respect to two documents. Two copies of the same document are just that: two different instances and not, therefore, the same. They may, however, be very similar and, in practice, "the same" has a pragmatic interpretation. It means that two or more objects are equally acceptable for some purpose. A photocopy of a document using only one side of the paper is distinguishable from another one made using both sides of the paper. If we wish only to read the text, the different versions are "all the same to us" and one could be a substitute for the other. However, if we intended to cut and paste the paper text, the difference would matter. Similarly, we might find ascii and bit-mapped images of a text equally acceptable for some purposes, but not for others. A low-resolution version of a high-resolution image may be as acceptable in some circumstances but not in others. It follows that the determination of sameness, being situational, can only be done during the selection process, not at the time of storage. The most that can be done during storage and indexing (i.e. prior to retrieval) is to categorize the documents so as facilitate future decisions concerning substitutability or, as an economy, impose some conventional threshold of sameness. For printed documents on paper the conventional technical threshold for "same" is the "edition", with the terms "impression", "variant" and "copy" (or "exemplar") being used for finer distinctions when needed.

SEARCH THEORY

What follows briefly summarizes the most relevant elements of search theory.

(a) Likelihood of finding. If the cost of searching different sources were identical or irrelevant, the optimal search strategy is to look first at the source most likely to contain what is sought, if we happened to have such a priori knowledge, then the second most likely, and so on. In other words the sources should be ranked and searched in decreasing likelihood of finding what is sought.

(b) Cost of searching. If the expected likelihood of search success were the same for all sources, then the least expensive search strategy would be to start with the source that is the least expensive to search, then (if need be) move on to the least expensive remaining source, and so on. In other words, the sources should be ranked and searched in order of the increasing search cost.

(c) Cost-effective searching. Where the sources vary with respect to both the likelihood of finding and to the cost of searching, the optimal search strategy is achieved by establishing the expected cost-benefit ratio for each source, expressed as the likelihood of finding what is sought divided by the expected cost of searching that source. One then starts with the source with the highest value for this ratio, then moves on that with the highest ratio among the remaining sources, and so on. In other words, the sources should be ranked and searched in decreasing order of the expected search success / search cost ratio.

(d) Stopping the search. While these rules tell one where to look, they do not tell one whether or when to stop searching before all sources have been searched. For this it is necessary to compare the marginal cost-benefit with some other, alternative use of resources. [Note also: Satisficing: Mooers' Law; and elasticity of demand.]

Search diseconomy. If there is a cost in, for example, time, effort, or money, associated with searching, searching all sources may not be affordable or warranted. In particular, since libraries tend to have at least some degree of overlap in contents, diminishing returns in terms of search results are to be expected as the number of sources searched increases. The marginal benefit will tend to diminish but the marginal cost can be expected to remain stable or to increase. The combined effect is an increase in the marginal cost per relevant instance found as well as a steady increase in overall cost for the entire search and, therefore, an increase in average cost per relevant instance found. At some stage the cost of continuing the search is likely to become uneconomical, depending on the circumstances and assumed alternative uses of constrained resources.