+ Page 1 + ----------------------------------------------------------------- The Public-Access Computer Systems Review Volume 3, Number 1 (1992) ISSN 1048-6542 ----------------------------------------------------------------- To retrieve an article file as an e-mail message, send the GET command given after the article information to LISTSERV@UHUPVM1 (BITNET) or LISTSERV@UHUPVM1.UH.EDU (Internet). To retrieve the article as a file, omit "F=MAIL" from the end of the GET command. CONTENTS REFEREED ARTICLES Analysis of Search Failures in Document Retrieval Systems: A Review By Yasar Tonta (pp. 4-53) To retrieve this file: GET TONTA PRV3N1 F=MAIL 1.0 Introduction 2.0 Overview of a Document Retrieval System 3.0 Search Failure Analysis 3.1 Measures of Retrieval Effectiveness 3.2 Methods of Analyzing Search Failures 3.2.1 Analysis of Search Failures Utilizing Retrieval Effectiveness Measures 3.2.2 Analysis of Search Failures Utilizing User Satisfaction Measures 3.2.3 Analysis of Search Failures Utilizing Transaction Logs 3.2.4 Analysis of Search Failures Utilizing the Critical Incident Technique 3.3 Summary 4.0 Review of Studies Analyzing Search Failures 4.1 Studies Utilizing Precision and Recall Measures 4.1.1 The Cranfield Studies 4.1.2 Lancaster's MEDLARS Studies 4.1.3 Blair and Maron's Full-Text Retrieval System Study 4.1.4 Markey and Demeyer's Dewey Decimal Classification Online Project 4.2 Studies Utilizing User Satisfaction Measures 4.3 Studies Utilizing Transaction Logs 4.4 Studies Utilizing the Critical Incident Technique 4.5 Other Search Failure Studies 4.6 Related Studies 5.0 Conclusion + Page 2 + COLUMNS Public-Access Provocations: An Informal Column Those Who Don't, Won't By Walt Crawford (pp. 54-56) To retrieve this file: GET CRAWFORD PRV3N1 F=MAIL REVIEWS Zen and the Art of the Internet: A Beginner's Guide to the Internet Reviewed by Billy Barron (pp. 57-59) To retrieve this file: GET BARRON PRV3N1 F=MAIL ----------------------------------------------------------------- The Public-Access Computer Systems Review Editor-in-Chief Charles W. Bailey, Jr. University Libraries University of Houston Houston, TX 77204-2091 (713) 743-9804 LIB3@UHUPVM1 (BITNET) or LIB3@UHUPVM1.UH.EDU (Internet) Associate Editors: Columns: Leslie Pearse, OCLC Communications: Dana Rooks, University of Houston Reviews: Roy Tennant, University of California, Berkeley Editorial Board Ralph Alberico, University of Texas, Austin George H. Brett II, University of North Carolina General Administration Steve Cisler, Apple Walt Crawford, Research Libraries Group Lorcan Dempsey, University of Bath Nancy Evans, Pennsylvania State University, Ogontz + Page 3 + Charles Hildreth, READ Ltd. Ronald Larsen, University of Maryland Clifford Lynch, Division of Library Automation, University of California David R. McDonald, Tufts University R. Bruce Miller, University of California, San Diego Paul Evan Peters, Coalition for Networked Information Mike Ridley, University of Waterloo Peggy Seiden, Pennsylvania State University, New Kensington Peter Stone, University of Sussex John E. Ulmschneider, North Carolina State University Publication Information Published on an irregular basis by the University Libraries, University of Houston. Technical support is provided by the Information Technology Division, University of Houston. Circulation: 3,906 subscribers in 35 countries (PACS-L) and 230 subscribers in 14 countries (PACS-P). Back issues are available from LISTSERV@UHUPVM1 (BITNET) or LISTSERV@UHUPVM1.UH.EDU (Internet). To obtain a list of all available files, send the following e-mail message to the LISTSERV: INDEX PACS-L. The name of each issue's table of contents file begins with the word "CONTENTS." ----------------------------------------------------------------- ----------------------------------------------------------------- The Public-Access Computer Systems Review is a refereed electronic journal that is distributed on BITNET, Internet, and other computer networks. There is no subscription fee. To subscribe, send an e-mail message to LISTSERV@UHUPVM1 (BITNET) or LISTSERV@UHUPVM1.UH.EDU (Internet) that says: SUBSCRIBE PACS-P First Name Last Name. PACS-P subscribers also receive two electronic newsletters: Current Cites and Public- Access Computer Systems News. The Public-Access Computer Systems Review is Copyright (C) 1992 by the University Libraries, University of Houston. All Rights Reserved. Copying is permitted for noncommercial use by computer conferences, individual scholars, and libraries. Libraries are authorized to add the journal to their collection, in electronic or printed form, at no charge. This message must appear on all copied material. All commercial use requires permission. ----------------------------------------------------------------- + Page 57 + ----------------------------------------------------------------- The Public-Access Computer Systems Review 3, no. 1 (1992): 57-59. ----------------------------------------------------------------- ----------------------------------------------------------------- Kehoe, Brendan P. Zen and the Art of the Internet: A Beginner's Guide to the Internet. Chester, PA: 1992. [Computer file] Reviewed by Billy Barron. ----------------------------------------------------------------- Zen and the Art of the Internet is a new guide to the Internet that was written by Brendan Kehoe of Widener University. His goal was to introduce the reader to the resources that are available on the Internet. At the same time, Kehoe tried to avoid system specific information. It should be noted that parts of Zen and the Art of the Internet were derived from other works. Zen and the Art of the Internet starts off with a chapter on network basics. This chapter is a good introduction to the Internet, but it is not a general guide to networking. Rather, it is Internet and TCP/IP specific. If this chapter can be faulted for anything, it is that it oversimplifies some of the material. On the other hand, it definitely should not scare off the novice user. The e-mail and FTP chapters are very good, although they do get technical at times. The e-mail chapter could be improved by the addition of a section on etiquette similar to the excellent one in the FTP chapter. The Telnet chapter is packed with examples of Telnet-accessible services, and it explains how to find out about more services. I was rather disappointed by the omission of any information on tn3270. A description of how Telnet is different on IBM mainframes is also needed. These omissions may lead to some confusion on the part of IBM mainframe users. Kehoe describes other tools that are available on the Internet. These descriptions are well-rounded and useful, but Kehoe has just covered the most common tools. One of the most outstanding sections of Zen and the Art of the Internet is called "Things You'll Hear About." In a lot of ways, this chapter is a FAQ (Frequently Asked Questions) to the Internet, and it will answer many questions of the new network user. At the same time, it introduces the novice user to the folklore of the Internet without being intimidating. + Page 58 + Zen and the Art of the Internet also has useful sections that contain information about commercial services, other networks, how to retrieve files, and how to find out more about the Internet. The USENET chapter does a great job of covering the most common misconceptions people have about that network. The document includes a helpful glossary. The conclusion states "this guide is far from complete--the Internet changes on a daily (if not hourly) basis." Then Kehoe goes on to ask for suggestions. For Zen and the Art of the Internet to be useful in the long run, it will need to be updated on a fairly regular basis. From what I can tell, it sounds like Kehoe is planning on doing this. I'm sending in my suggestions, and I highly recommend you do the same. Overall, I was very impressed with this document. In fact, the same day that I downloaded it I had our receptionist make copies and distribute them to the whole Academic Computing Support Staff. In a couple of days, I am going to do the same with our library. My girlfriend's university just got on the Internet and I'm giving her two sources of information to start with: the first is HYTELNET and the second is going to be Zen and the Art of the Internet. It has a few rough spots, but I'm sure that Kehoe will fix them. The biggest problem is that it paints too rosy a picture of the Internet, but this kind of document is intended to get users interested in the network not to critique it. I try to stay ahead of most Internet users in terms of my knowledge of what's available and how to access it. Well, I learned a couple of things while reading Zen and the Art of the Internet, so it is not just for novices. At the same time, it is easily understandable by novices. My message to Brendan Kehoe is: Keep up the good work! Access Instructions The file is available on host FTP.CS.WIDENER.EDU (147.31.254.132) in the directory pub/zen and on FTP.UU.NET in (137.39.1.9) in the directory /inet/doc. Although the author reports that he has signed an agreement with a major publishing house, he has indicated that the network versions will continue to be available. + Page 59 + About the Author Billy Barron VAX/UNIX Systems Manager University of North Texas BILLY@UNT.EDU ----------------------------------------------------------------- The Public-Access Computer Systems Review is a refereed electronic journal that is distributed on BITNET, Internet, and other computer networks. There is no subscription fee. To subscribe, send an e-mail message to LISTSERV@UHUPVM1 (BITNET) or LISTSERV@UHUPVM1.UH.EDU (Internet) that says: SUBSCRIBE PACS-P First Name Last Name. PACS-P subscribers also receive two electronic newsletters: Current Cites and Public- Access Computer Systems News. This article is Copyright (C) 1992 by Billy Barron. All Rights Reserved. The Public-Access Computer Systems Review is Copyright (C) 1992 by the University Libraries, University of Houston. All Rights Reserved. Copying is permitted for noncommercial use by computer conferences, individual scholars, and libraries. Libraries are authorized to add the journal to their collection, in electronic or printed form, at no charge. This message must appear on all copied material. All commercial use requires permission. ----------------------------------------------------------------- + Page 54 + ----------------------------------------------------------------- "Public-Access Provocations: An Informal Column." The Public- Access Computer Systems Review 3, no. 1 (1992): 54-56. ----------------------------------------------------------------- "Those Who Don't, Won't" By Walt Crawford It's all well and good to blather on about refinements and extensions to online catalogs. More power to those providing full-text access and adding retrieval for visual materials and sound. Good access to non-textual material is a vital part of modern library catalogs. What provokes me this time, however, has to do with the end- result of most library catalog searching: text. The provocation comes in the term "post-literate," particularly as used by those who say "text-impaired" in the sense of "hung up on text, that old-fashioned narrow channel of communication." A Post-Literate Society Would be a Medieval Society Need I say more? Perhaps--but that heading is the gist of this diatribe, and my optimistic viewpoint shows up in the word "would" rather than "will." I don't believe we're headed for a post-literate society (and desperately hope that we're not), and feel that librarians should make every effort to see to it that we're not. Some of you will already know the prequel to this column's title: Dick Dougherty's ALA Presidential slogan, "Kids who read, succeed." As a rejoinder to those who urge us to move smoothly into a post-literate society, the follow-on is more important: "Those who don't, won't." + Page 55 + In a society where most people were "visually literate" or "media literate" but lack solid, well-developed, constantly-used reading skills, the literate or "text-oriented" minority would be in control--perhaps not always overtly, but most assuredly where it counts. In that dystopian vision, we would revert to a medieval state where the knowledgeable few rule the ignorant masses. This time the ignorant masses would not think themselves ignorant, since they would be flooded with "information" through the vastly richer channels of sight and sound. I've heard some of the visions of those who believe that text doesn't matter. Business people will solve problems by working through virtual-reality embodiments of the situation at hand, or use video game-like methods to arrive at the best course of action. Maybe I'm getting gullible in middle age, but (God help me) I believe that some of these people are serious! Narrowness Can be a Virtue Yes, text represents a narrower communication channel than sight and sound. Another way of saying that is that text provides specificity. Text is also linear, which makes it ideal for logical operations, case-building and the other tools of argument. (For the purposes of this discussion, mathematics is a special case of text: even narrower, even more specific, and generally even more linear.) Of course text isn't all there is. A description of the Sistine Chapel can't substitute for photographs of the ceiling and chapel itself, which in turn don't really substitute for being there. A description of Stravinsky's "Variations on 'Vom Himmel Hoch'" would be pretty pallid, while the music itself is an astonishing cross-century blend and a considerable pleasure. I watch network TV (and I don't mean PBS) and enjoy it; a description of "Evening Shade" or "Northern Exposure" could surely not replace the experience itself. Turning to more "factual" matters, I would never approve a proposed extension to my house without seeing appropriate drawings--but I would also never approve the extension without detailed textual and mathematical support for those drawings, in the form of firm costs, structural details, and a proper contract. + Page 56 + Literacy Empowers Without the ability to read carefully, thoroughly and thoughtfully, a person will always be at the mercy of others. That's true of mathematical literacy as well, to be sure: if you can't do approximate calculations in your head, you have no basis for challenging a dishonest tradesperson or a simple error in charging. Reading, thinking, understanding: these provide power, the power to participate fully in the modern world. In a post- literate society, only the reading minority would have that power. That is, I maintain, a future to be avoided rather than dealt with. One final note. If you believe that this column is an attack on multimedia, or that I am arguing that only print is important or that entertainment is unimportant: go back and read it again, this time more carefully. About the Author Walt Crawford is a Senior Analyst at The Research Libraries Group (Mountain View, CA), and is Vice President/President- Elect of the Library & Information Technology Association (LITA), a division of the American Library Association. ----------------------------------------------------------------- The Public-Access Computer Systems Review is a refereed electronic journal that is distributed on BITNET, Internet, and other computer networks. There is no subscription fee. To subscribe, send an e-mail message to LISTSERV@UHUPVM1 (BITNET) or LISTSERV@UHUPVM1.UH.EDU (Internet) that says: SUBSCRIBE PACS-P First Name Last Name. PACS-P subscribers also receive two electronic newsletters: Current Cites and Public- Access Computer Systems News. This article is Copyright (C) 1992 by Walt Crawford. All Rights Reserved. The Public-Access Computer Systems Review is Copyright (C) 1992 by the University Libraries, University of Houston. All Rights Reserved. Copying is permitted for noncommercial use by computer conferences, individual scholars, and libraries. Libraries are authorized to add the journal to their collection, in electronic or printed form, at no charge. This message must appear on all copied material. All commercial use requires permission. ----------------------------------------------------------------- + Page 4 + ----------------------------------------------------------------- Tonta, Yasar. "Analysis of Search Failures in Document Retrieval Systems: A Review." The Public-Access Computer Systems Review 3, no. 1 (1992): 4-53. [Refereed article] ----------------------------------------------------------------- Abstract This paper examines search failures in document retrieval systems. Since search failures are closely related to overall document retrieval system performance, the paper briefly discusses retrieval effectiveness measures such as precision and recall. It examines four methods used to study retrieval failures: retrieval effectiveness measures, user satisfaction measures, transaction log analysis, and the critical incident technique. It summarizes the findings of major failure analysis studies and identifies the types of failures that usually occur in document retrieval systems. 1.0 Introduction Online document retrieval systems often fail to retrieve some relevant documents. More often than not they also retrieve nonrelevant documents. Such search failures may occur due to a variety of reasons, including problems with user-system interfaces, retrieval rules, and indexing languages. Studying search failures presents extremely complicated problems. For instance, it is not clear exactly what constitutes a "search failure." While some researchers study search failures using retrieval effectiveness measures such as precision and recall, others prefer using "user satisfaction" as a criterion in deciding whether a search has failed or not. This paper will look at various (mostly implied) definitions of "search failure" and discuss some of the methods used in failure analysis studies. + Page 5 + 2.0 Overview of a Document Retrieval System The principal function of a document retrieval system is to retrieve all relevant documents from a store of documents, while rejecting all others. A perfect document retrieval system would retrieve ALL and ONLY relevant documents. Maron [1] provides a more detailed description of the document retrieval problem and depicts the logical organization of a document retrieval system (see Figure 1). ----------------------------------------------------------------- Figure 1. Logical Organization of a Conventional Document Retrieval System. Source: Maron [2]. ----------------------------------------------------------------- |------<-----| | | |---------------| |-----V-------| | | Incoming | | Inquiring | | | documents | | patron | | |-------|-------| |-----|-------| | | | | |-------V-------| |------------| |-----V-------| | | Document |---->| Thesaurus |---->| Query | | | identification| | Dictionary | | formulation | | | (Indexing) |<----| |<----| | | |-------|-------| |-----|------| |-----|-------| | | | | | |-------V-------| |-----V------| |-----V-------| | | Index | | Retrieval | | Formal | | | records |---->| rule |<----| query | | |---------------| |-----|------| |-------------| | | | |--------------->---------------| ----------------------------------------------------------------- + Page 6 + As Figure 1 suggests, the basic characteristics of each incoming document (e.g., author, title, and subject) are identified during the indexing process. Indexers may consult thesauri or dictionaries (controlled vocabularies) in order to assign acceptable index terms to each document. Consequently, an index record is constructed for each document for subsequent retrieval purposes. A user can identify proper search terms by consulting these index tools during the query formulation process. After checking the validity of initial terms and identifying new ones, the user determines the most promising query terms (from the retrieval point of view) to submit to the system as the formal query. However, most users do not know about the tools that they can utilize to express their information needs, which results in search failures because of a possible mismatch between the user's vocabulary and the system's vocabulary. Maron describes the search process as follows: the actual search and retrieval takes place by matching the index records with the formal search query. The matching follows a rule, called "Retrieval Rule," which can be described as follows: For any given formal query, retrieve all and only those index records which are in the subset of records that is specified by that search query [3]. Thus, a document retrieval system consists of (1) a store of documents (or, representations thereof); (2) a population of users each of whom makes use of the system to satisfy their information needs; and (3) a retrieval rule which compares the representation of each user's query with the representations of all the documents in the store so as to identify the relevant documents in the store. There also should be a user interface to allow users to interact with the system. In reality, the ideal document retrieval system discussed in this section does not exist. Document retrieval systems do not retrieve ALL and ONLY relevant documents, and users may be satisfied with systems that rapidly retrieve a few relevant documents. + Page 7 + 3.0 Search Failure Analysis Before reviewing major failure analysis studies, it is helpful to examine some approaches used in studying search failures in document retrieval systems and to discuss the various definitions of "search failure" used by researchers. After all, we cannot analyze search failures if we do not recognize them. 3.1 Measures of Retrieval Effectiveness Retrieval effectiveness measures such as "precision" and "recall" are widely used to evaluate the effectiveness of online document retrieval systems. A few measures, which are discussed below, are also used in the study of search failures. This paper will not review all the measures of retrieval effectiveness suggested in the literature since they are seldom, if ever, used in the analysis of search failures. Precision is defined as the proportion of retrieved documents which are relevant, whereas recall is defined as the proportion of relevant documents retrieved [4]. These two measures are generally used in tandem in evaluating retrieval effectiveness in document retrieval systems. Precision can be taken as the ratio of the number of documents that are judged relevant for a particular query over the total number of documents retrieved. For instance, if, for a particular search query, the system retrieves two documents and the user finds one of them relevant, then the precision ratio for this search would be 50%. Recall is considerably more difficult to calculate than precision because it requires finding relevant documents that will not be retrieved during users' initial searches [5]. Recall can be taken as the ratio of the number of relevant documents retrieved over the total number of relevant documents in the collection. Take the above example. The user judged one of the two retrieved documents to be relevant. Suppose that later three more relevant documents that the original search query failed to retrieve were found in the collection. The system retrieved only one out of the four relevant documents from the database. The recall ratio would then be equal to 25% for this particular search. + Page 8 + "Fallout" is another measure of retrieval effectiveness. Fallout can be defined as the ratio of nonrelevant documents retrieved over all the nonrelevant documents in the collection. The earlier example also can be used to illustrate fallout. The user judged one of the two retrieved documents as relevant, and, later, three more relevant documents that the original query missed were identified. Further suppose that there are nine documents in the collection altogether (four relevant plus five nonrelevant documents). Since the user retrieved one nonrelevant document out of a total of five nonrelevant ones in the collection, the fallout ratio would be 20% for this search. 3.2 Methods of Analyzing Search Failures This section discusses the analysis of search failures using retrieval effectiveness methods (e.g., recall), user satisfaction measures, transaction logs, and the critical incident technique. 3.2.1 Analysis of Search Failures Utilizing Retrieval Effectiveness Measures If precision and recall are seen as performance measures with the given definitions, it instantly becomes clear that "performance" can no longer be defined as a dichotomous concept. As precision and recall are defined as percentages, we can think of "degrees" of search failure or success. This view would probably best reflect different performance levels attained by current document retrieval systems. It is impossible to find a perfect document retrieval system. In reality, retrieval systems are imperfect, and they are better or worse than one another. + Page 9 + Performance measures such as precision and recall can be used in the analysis of search failures. In the precision example in Section 3.1, only 50% of the documents retrieved were relevant, resulting in a precision of 50%. If each nonrelevant document that the system retrieves for a given query represents a search failure, then it is also possible to think of precision as a measure of search failure: failure to retrieve relevant documents ONLY. The more nonrelevant documents the system retrieves for a given query, the higher the degree of precision failures. If no retrieved document happens to be relevant, then the precision ratio becomes zero due to severe precision failures. In the recall example, the recall ratio was 25%, implying that the system missed 75% of the relevant documents in the collection. If each missed relevant document represents a search failure, then it is possible to think of recall as a measure of search failure: failure to retrieve ALL relevant documents in the collection. The more relevant documents the system misses the higher the degree of recall failure. If the system fails to retrieve any relevant documents from the collection, then the recall ratio becomes zero due to severe recall failures. Precision and recall are two different quantitative measures of aggregation of search failures. For convenience, search failures analyzed using precision and recall are called precision failures and recall failures. Precision failures can easily be detected. They occur when the user finds some retrieved documents nonrelevant, even if those documents are assigned the index terms that the user initially asked for in the search query. Users may feel that index terms have been incorrectly assigned to documents that are not really relevant to those subjects. It should be noted that "relevance" is defined as a relationship "between a document and a person in search of information" and it is a function of a large number of variables concerning both the document (e.g., what it is about, its currency, language, and date) and the person (e.g, person's education and beliefs) [6]. (For a comprehensive review of the concept of "relevance," see [7].) + Page 10 + Recall failures mainly occur because index terms that users would normally utilize to retrieve documents about particular subjects do not get assigned to documents that are relevant to those subjects. As stated earlier, detecting recall failures, especially in large scale document retrieval systems, is much more difficult. Researchers have therefore used somewhat different approximations to calculate recall figures in their experiments. Although information retrieval textbooks mention "fallout" as a measure of retrieval effectiveness, the author is not aware of any experiment where fallout ratio has been successfully calculated [8]. Calculating the fallout ratio in large collections is as difficult, if not more difficult, as calculating the recall ratio. To calculate the fallout ratio, all nonrelevant documents retrieved during the search must be identified, all nonrelevant documents in the overall collection must be found, and the size of the collection must be established. It is tempting to say that documents that are not retrieved are probably not relevant; however, since recall failures do occur in document retrieval systems, this is not the case. If all of the unretrieved documents in a collection were scanned, some of them would be relevant. The fallout ratio could then be calculated. It should be noted that this method can only be used for specific queries where the number of relevant documents in the whole collection is known to be small. "Fallout failures" do occur constantly in document retrieval systems even if it is impractical to quantify them. Whenever the system retrieves too many nonrelevant records, users feel the consequences of fallout failure. Either they must scan long lists of useless records (hence "fallout") or abandon the search. + Page 11 + Notice that fallout failures also can be seen as severe precision failures. Fallout failure has not been adequately studied; however, it is known that users tend to resist scanning through screens of retrieved items. For instance, Larson [9] found that in a large online catalog the average number of records retrieved was 77.5, but users scanned an average of less than 10 records per search. It is not clear why the users stopped scanning after a few records. Some may have been satisfied with the results. Some users might have abandoned their searches due to frustration because the system retrieved too many unpromising, nonrelevant records [10]. It would be interesting to study what percentage of searches in online catalogs get abandoned in view of user frustration from fallout failures. It is also theoretically possible to envision "perverse" document retrieval systems where, for a given query, the system first would retrieve all nonrelevant documents before it would eventually retrieve relevant ones [11]. However, in real life, "perverse" document retrieval systems are unlikely to exist. Mainly, retrieval effectiveness measures are used to determine and study three types of search failures: (1) retrieving nonrelevant documents (precision failures); (2) missing relevant documents (recall failures); and (3) retrieving too many unpromising, nonrelevant documents (fallout failures). Failure analysis aims to find out the causes of these failures so that existing systems can be improved in a variety of ways. + Page 12 + So far, this paper has examined a few of the measures of retrieval effectiveness and the ways in which they are used in the study of search failures. It was noted that document retrieval systems are not perfect and that we cannot expect them to achieve, or even approximate, the impossible ideal of retrieving ALL and ONLY relevant documents in the collection. Some would argue that users would like to find some relevant documents, but not necessarily ALL of them, unless (as in rare occasions such as patent searching) ALL are wanted. Users prefer high precision to high recall. They wish to retrieve "some good references without having to examine too many bad ones" [12]. Consequently, it is more important for a document retrieval system to "distinguish between wanted and unwanted items" quickly than to retrieve all relevant items in the collection. It also should be noted that not everyone is satisfied with the most commonly used retrieval effectiveness measures (precision and recall). For instance, Cooper has questioned the use of recall as a performance measure because it takes into account not only retrieved documents, but also unretrieved documents. In his view, this is wasted effort since the relevance of unretrieved documents has little bearing on the notion of subjective user satisfaction [13]. He maintains that "an ideal evaluation methodology must somehow measure the ultimate worth of a retrieval system to its users in terms of an appropriate unit of utility" [14]. 3.2.2 Analysis of Search Failures Utilizing User Satisfaction Measures Some failure analysis studies are based on user satisfaction measures, rather than on retrieval effectiveness measures. Although it may at first seem straightforward, analyzing search failures utilizing user satisfaction measures is a complex process that provides interesting challenges. + Page 13 + First, defining user satisfaction is difficult. Several authors tried to address this issue. Tessier, Crouch, and Atherton discussed such factors as the search output, the intermediary, the service policies, and the "library as a whole" as the main determinants of the user satisfaction [15]. Bates examined the effects of "subject familiarity" and "catalog familiarity" on search success and found that the former has a slight detrimental effect, while the latter has a very significant beneficial effect on search success [16]. Tessier used factor analysis and multiple regression techniques to study the influence of various variables on overall search satisfaction. She found that "the strongest predictors of satisfaction were the precision of search, the amount of time saved, and the perceived quality of the database as a source of information" [17]. Hilchey and Hurych found "a strong positive relationship between perceived relevance of citations and search value" when they performed a statistical analysis on the online reference questionnaire forms returned by the users in a university library [18]. Second, user satisfaction relies heavily on users' judgments about search failures or successes; however, users' judgments may be inconsistent for various reasons. For example, Tagliacozzo found that "MEDLINE was perceived as 'helpful' by respondents who, in other parts of the questionnaire [used in the author's research], showed that they had NOT found it particularly useful" [19, (original emphasis)]. Tagliacozzo warns us: "Caution should therefore be used in taking the users' judgments at face value, and in inferring from single responses that their information needs were, or were not, satisfied by the service" [20]. It follows that it is not usually sufficient to obtain a binary "Yes/No" response from the user about being satisfied or not satisfied with the results. Ankeny found that the use of a two-point (yes-no) scale "appeared to result in inflated success ratings" [21]. When pressed, users are likely to come up with further explanations. For example, a user might say: "Yes, in a way my search was successful even though I couldn't find what I wanted." A second user might say that a given search was not successful because "it did not retrieve anything new." + Page 14 + A researcher getting such answers would have hard time classifying them. The data gathering tools that the researcher employs to elicit information from users should be sensitive enough to handle such answers by asking more detailed questions. After all, a decision has to be made if a search was successful or not. Further conditions have been introduced in some studies to facilitate this decision-making process. In Ankeny's study, for example, a successful search has three characteristics: the patron must indicate that s/he found EXACTLY what was wanted, that s/he was FULLY satisfied with the search, and that s/he marked none of the 10 listed reasons for dissatisfaction where the reasons for dissatisfaction ranged from "system problems" to "too much information," from "information not relevant enough" to "need different viewpoint" [22, (original emphasis)]. Nevertheless, it is still possible that a given search may be a failure even if answers given by a user met all three of these conditions. It was noted earlier that users tend to abandon some searches that retrieve too many items. Many users may prefer to retrieve a few relevant documents quickly. They would not consider a search as a "failure" even if the system has missed some relevant documents (i.e., recall failure). User satisfaction measures are influenced by both user group and search goal factors. For example, an undergraduate student writing a term paper may be satisfied if a search retrieves a few relevant textbooks. However, the situation is entirely different for a health professional. This user may want to know everything about a certain case because the outcome of missing relevant information may have serious consequences. For example, a health professional investigating a medical procedure on "MEDLINE only found records showing it to be safe, missing the reports of fatalities associated with the procedure" [23]. + Page 15 + The above examples show that some caution is needed when interpreting users' indication of satisfaction. There are some published studies that show that "in many cases high levels of reported end-user 'satisfaction' . . . may not reflect true success rates" [24]. Furthermore, as Cheney notes, we do not "know what end users expect of their search results, because no study has examined end users' expectations of database searching. Neither has any study examined the actual quality of end-user search results measured in terms of precision and recall" [25]. So far, the discussion has concentrated on the analysis of search failures that were based on retrieval effectiveness or "user satisfaction." As part of a carefully designed and conducted experiment under "as real-life a situation as possible," Saracevic and Kantor studied, among other things, the relationship between user satisfaction and precision and recall [26]. Their experiment involved 40 users who each submitted a query that reflected a real information need. Thirty-nine professional searchers did online searches on Dialog databases for these queries. Each query was searched by nine different professionals and the results were combined for evaluation purposes. The precision ratio for a given search was estimated as the number of relevant items retrieved by the search divided by the total number of items retrieved by the search. Similarly, recall ratio was estimated as the number of relevant items retrieved by the search divided by the total number of relevant items in the union of items retrieved by all searchers for that question [27]. Five utility measures were used: (1) whether the user's participation and the resultant information was worth it (on a five-point scale); (2) time spent; (3) perceived (by the users) dollar value of the items; (4) whether the information contributed to the resolution of the research problem (on a five-point scale); and (5) whether the user was satisfied with the results (on a five- point scale). + Page 16 + They found that "searchers in questions where users indicated high overall satisfaction with results . . . were 2.49 times more likely to have higher precision" [28]. They interpreted their findings pertaining to the relationship between utility measures and retrieval effectiveness measures as follows: In general, retrieved sets with high precision increased the chance that users assessed that the results were "worth more of their time than it took," were "high in dollar value," contributed "considerably to their problem resolution," and "were highly satisfactory." On the other hand, high recall did not significantly affect the odds for any of those measures. . . . These are interesting findings in another respect. They indicate that utility of results (or user satisfaction) may be associated with high precision, while recall does not play a role that is even closely as significant. For users, precision seems to be the king and they indicated so in the type of searches desired. In a way this points out to the elusive nature of recall: this measure is based on the assumption that something may be missing. Users cannot tell what is missing any more than searchers or systems can. However, users can certainly tell what is in their hand, and how much is NOT relevant [29, (original emphasis)]. 3.2.3 Analysis of Search Failures Utilizing Transaction Logs The availability of transaction logs, which record users' interaction with the document retrieval systems, provides the opportunity to study and monitor search failures unobtrusively. Larson states: "Transaction monitoring, in its simplest form, involves the recording of user interactions with an online system. More complete transaction monitoring also will record the system responses and performance data (such as response time for searches), providing enough information to reconstruct all of the user's interactions with the system" [30]. This includes search queries entered, records displayed, help requests, errors, and the system responses. (For a review of online catalog transaction log studies, see [31].) + Page 17 + Since transaction logs also contain invaluable information about failed searches, researchers have been interested in scanning transaction logs in order to identify failed searches. Several researchers identified "zero hits" from the transaction logs of selected online catalogs and looked into the reasons for search failures [32]. A few others employed the same method when they studied search failures in MEDLINE [33]. These researchers used a rather practical definition of search failure when scanning transaction logs. A search was treated as a failure if it retrieved no records. Needless to say, the definition of search failure as zero hits is incomplete since it does not include partial search failures. More importantly, there is no reason to believe that all "non-zero hits" searches were successful ones. Such an assumption would mean that no precision failures occurred in the systems under investigation! Furthermore, "not all zero hits represent failures for the patrons . . . It is possible that the patron is satisfied knowing that the information sought is not in the database, in which case the zero-hit search is successful" [34]. Precedence searching in litigation is an example of a zero-hit search that is successful. Some newer document retrieval systems such as Okapi and CHESHIRE can accommodate relevance feedback techniques and incorporate users' relevance judgments in order to improve retrieval effectiveness in subsequent iterations [35]. Transaction logs of such online catalogs also record the user's relevance judgment for each record that is displayed. Using these logs, the researcher is able to determine whether the user found a given record to be relevant or not. The availability of relevance judgments in transaction logs has opened up new avenues for studying search failures in online library catalogs. Researchers are now able to study not only zero-hit searches, but also failed searches that retrieve nonrevelant records. Obviously, the rendering of relevance judgments makes it easier to identify precision failures, but there still needs to be some kind of mechanism to identify recall failures. + Page 18 + What constitutes a search failure when the relevance judgment for each retrieved document is recorded in the transaction log? Some researchers came up with yet another practical definition of search failure and analyzed it accordingly. For example, during the evaluation of Okapi online catalog, a search was counted as a failure "if no relevant record appears in the first ten which are displayed" [36]. This definition of search failure is quite different from one based on precision and recall. It is dichotomous, and it assumes that users will scan at least ten records before quitting. This assumption might be true for some searches and for some users, but not for all searches and users. It also downplays the importance of search failures. Searches retrieving at least one relevant record in ten are considered "successful" even though the precision rate for such searches is quite low (10%). Although transaction monitoring offers unprecedented opportunities to study search failures in document retrieval systems and provides "highly detailed information about how users actually interact with an online system, . . . it cannot reveal their intentions or whether they are satisfied with the results" [37]. Some of the shortcomings of transaction monitoring in studying search failures are as follows. First, it is not clear what constitutes a "search failure" in transaction logs. As mentioned earlier, defining all zero-hit searches as search failures has some serious flaws. Second, transaction logs have very little to offer when studying recall failures in document retrieval systems. Recall failures can only be determined by using different methods such as analysis of search statements, indexing records, and retrieved documents. In addition, additional relevant documents that were not retrieved in the first place can be found by performing successive searches in the database. + Page 19 + Third, transaction logs can document search failure occurrences, but they cannot explain why a particular failure occurred. Search failures in online catalogs occur for a variety of reasons, including simple typographical errors, mismatches between users' search terms and the vocabulary used in the catalog, collection failures (i.e., requested item is not in the system), user interface problems, and the way search and retrieval algorithms function. Further information is needed about users' needs and intentions in order to find out why a particular search failed. Finally, since the users remain anonymous in transaction logs, analysis of these logs "prevents correlation of results with user characteristics" [38]. 3.2.4 Analysis of Search Failures Utilizing the Critical Incident Technique Based on their empirical investigation of tools, techniques, and methods for the evaluation of online catalogs, Hancock-Beaulieu, Robertson, and Neilson [39] found that "transaction logs can only be used as an effective evaluative method with the support of other means of eliciting information from users." One of the techniques to elicit information from users about their needs and intentions is known as "critical incident technique." Data gathered through this technique, which is briefly discussed below, facilitates the study of search failures in document retrieval systems. When it is used in conjunction with the analysis of transaction log data, the critical incident technique permits search failures to be correlated with user characteristics. The critical incident technique was first used during World War II to analyze the reasons that pilot candidates failed to learn to fly. Since then, this technique has been widely used, not only in aviation, but also in defining the critical requirements of and measuring typical performance in the health professions. Flanagan [40] describes the critical incident technique as follows: + Page 20 + The critical incident technique consists of a set of procedures for collecting direct observations of human behavior in such a way as to facilitate their potential usefulness in solving practical problems and developing broad psychological principles. The critical incident technique outlines procedures for collecting observed incidents having special significance and meeting systematically defined criteria. By an incident is meant any observable human activity that is sufficiently complete in itself to permit inferences and predictions to be made about the person performing the act. The critical incident technique essentially consists of two steps: (1) collecting and classifying detailed incident reports, and (2) making inferences that are based on the observed incidents. Recently, the critical incident technique has been used to assess "the effectiveness of the retrieval and use of biomedical information by health professionals" [41]. In the same study, researchers have used this technique to analyze and evaluate search failures in MEDLINE. Using a structured interview process that included administering a questionnaire, they asked users to comment on the effectiveness of online searches that they performed on the MEDLINE database. Each report obtained through structured interviews was called an "incident report." Researchers matched these incident reports against MEDLINE transaction log records corresponding to each search in order to find out the actual reasons for search success or failure. These incident reports provided much sought after information about user needs and intentions, and they put each transaction log record in context by linking search data to the searcher. + Page 21 + Although the critical incident technique enables the researcher to gather information about user needs and intentions so that he or she can better explain the causes of search failures, it also has some shortcomings. Information gathered through the critical incident technique has to be corroborated with transaction log data. The verification of user satisfaction or dissatisfaction via transaction log data may provide further clues as to why searches succeed or fail. However, the researcher may not be able to confirm each and every user's account of his or her search from the transaction logs. As the users are usually not identified in the transaction logs, it is sometimes difficult to find the search in question in the logs. There are a variety of reasons for this problem. First, the user's advance permission has to be sought in order to examine his or her search(es) in the transaction logs. Second, users may not be able to recall the details of their searches after the fact. Third, the logs may not contain enough data about the search: the items displayed and users' relevance judgments are not recorded in most transaction logs. The lack of enough data in transaction logs also influences the effectiveness of the critical incident technique. The researcher has to rely a great deal on what the user says about the search. For instance, if the items displayed by the user along with relevance judgments are not recorded in the transaction logs, the researcher will not be able to find the precision ratio. Furthermore, the critical incident technique per se does not tell us much about the documents that the user may have missed during the search: we still have to find out about recall failures using other methods. 3.3 Summary This section discussed various methods of analyzing search failures in document retrieval systems. It emphasized that the issue of search failure is complex. It demonstrated that no single method of analysis is self-sufficient to characterize all the causes of search failures. The next section will review the findings of major studies in this area. + Page 22 + 4.0 Review of Studies Analyzing Search Failures Numerous studies have shown that users experience a variety of problems when they search document retrieval systems and they often fail to retrieve relevant documents. The problems users frequently encounter when searching, especially in online catalogs, are well documented in the literature [42]. However, few researchers have studied search failures directly [43]. What follows is a brief overview of major studies of search failures in document retrieval systems. Not surprisingly, the results of these studies are not directly comparable because they use different definitions and methods of analysis. 4.1 Studies Utilizing Precision and Recall Measures Several major studies employed precision and recall measures to analyze search failures. 4.1.1 The Cranfield Studies Cyril Cleverdon, who was Librarian of the College of Aeronautics at Cranfield, England, and his colleagues conducted a series of studies in late 1950s and early 1960s to investigate the performance of indexing systems [44]. They also studied the causes of search failures in document retrieval systems. This paper only reviews findings that pertain to search failures. In the first study (Cranfield I), Cleverdon compared the efficiency of retrieval effectiveness of four indexing systems: the Universal Decimal Classification, an alphabetical subject index, a special facet classification, and the uniterm system of co-ordinate indexing. Some 18,000 research reports and periodical articles in the field of aeronautics were indexed using these four indexing systems, and 1,200 queries were used in the tests [45]. The main purpose of the Cranfield I experiment was to test the ability of each indexing system to retrieve the "source document" upon which each query was based. Researchers knew beforehand that "there was at least one document which would be relevant to each question" [46]. The recall ratio was calculated based on the retrieval of source documents. However, this recall ratio should be regarded as a type of "constrained" recall since the objective was just to find source documents in the collection. Cranfield I tests have shown that "the general working level of I.R. systems appears to be in the general area of 60%-90% recall and 10%-25% of relevance [i.e., precision]" [47]. + Page 23 + During the tests, each search was "carried on to the stage where the source document was retrieved or alternatively the searcher was unable to devise any further reasonable search programmes" [48]. Each query was judged to be a success or failure: a search was a success if the source document was retrieved, a failure if it was not. Swanson states: "The decision to measure retrieval success solely in terms of the source document was prompted by an understandable, though unfortunate, desire to determine whether any given document was or was not relevant to the question" [49]. Relevant documents other than source documents, which would have been retrieved during the search, were not taken into account. The success rate for all searches was found as 78% [50]; source documents were successfully retrieved for most search queries. Cleverdon's analysis of search failures was based on 329 documents and queries. The total number of search failures was 495 [51]. He classified the causes of search failures under four main headings: (1) question, (2) indexing, (3) searching, and (4) system. Each heading included further subdivisions to specify the exact cause(s) of each search failure. For example, questions could be "too detailed," "too general," "misleading" or just plain "incorrect." Likewise, insufficient, incorrect, or careless indexing; insufficient number of entries; and lack of cross references caused further search failures. Included under searching were "lack of understanding," "failure to use all concepts," "failure to search systematically," and "incorrect" or "insufficient searching." The lack of some features in indexing systems, such as synonymity and inability to combine particular concepts, also caused search failures. The number of failed searches under each subdivision is given in several tables. The reasons for failures in searches carried out by the project staff are as follows: questions, 17%; indexing process, 60%; searching 17%; and, indexing system, 6%. The percentages of failures in searches performed by the technical staff (i.e., the end-users) were somewhat higher for searching (37%). + Page 24 + It appears that well over half of the failures in this study were caused by the indexing process. Cleverdon summarizes the results of the analysis of search failures as follows [52]: The analysis of failures . . . shows most decisively that the failures were, for more than all other reasons together, due to mistakes by the indexers or searchers, and that a third of the failures could have been avoided if the project staff had indexed consistently, as well as they were capable of doing. Put another way, this means that in every hundred documents, the indexers failed to index adequately five documents, the failure usually consisting of the omission of some particular concept. The second study (Cranfield II) conducted by Cleverdon and his colleagues was an attempt to investigate the performance of indexing systems based on such factors as the exhaustivity of indexing and the level of specificity of the terms in the index language. The test collection consisted of some 1,400 research reports and periodical articles on the subject of aerodynamics and aircraft structures. Some 221 queries (all single theme queries) were obtained from the authors of selected published papers. However, most tests were based on 42 queries and 200 documents [53]. Precision and recall were used to determine the retrieval effectiveness of indexing systems. It is difficult to cite a single performance figure because the Cranfield II experiment involved a number of different index languages with a large number of variables. It was found that there exists an inverse relationship between recall and precision and that "the two factors which appear most likely to affect performance are the level of exhaustivity of indexing and the level of specificity of the terms in the index language" [54]. As noted in the preface to volume two of the report, a detailed intellectual analysis of the reasons for search failures was not carried out. + Page 25 + 4.1.2 Lancaster's MEDLARS Studies The Cranfield projects tested retrieval effectiveness in a laboratory setting, and the size of the test collection was small (1,400 documents). By contrast, Lancaster, studied the retrieval effectiveness of a large biomedical reference retrieval system (MEDLARS) in operation [55]. The MEDLARS database (Medical Literature Analysis and Retrieval System) contained some 700,000 records at that time. Some 300 "real life" queries were obtained from researchers and were used in the tests. The retrieval effectiveness of the MEDLARS search service was measured using precision and recall. The precision ratio was calculated according to the definition given in section 3.1. However, it would have been extremely difficult to calculate a true recall figure in a file of 700,000 records because this would have meant having the requester examine and judge each and every document in the collection. Lancaster explains how the recall figure was obtained: We therefore estimated the MEDLARS recall figure on the basis of retrieval performance in relation to a number of documents, judged relevant by the requester, BUT FOUND BY MEANS OUTSIDE MEDLARS. These documents could be, for example, 1. documents known to the requester at the time of his request, 2. documents found by his local librarian in non-NLM [National Library of Medicine] generated tools, 3. documents found by NLM in non-NLM-generated tools, 4. documents found by some other information center, or 5. documents known by authors of papers referred to by the requester [56, (original emphasis)]. + Page 26 + Relevant documents identified by the requester for each query made up the "recall base" upon which the calculation of the recall figure was based. An example illustrates how recall was calculated. The recall base consists of six documents that are known to the requester to be relevant before the search. Under these circumstances, if "only 4 are retrieved, we can say that the recall ratio for this search is 66%" [57]. Based on the results of 299 test searches, Lancaster found that the MEDLARS Search Service was operating with an average performance of 58% recall and 50% precision. Lancaster also studied the search failures using precision and recall. He investigated recall failures by finding some relevant documents using sources other than MEDLARS and then checking to see if the relevant documents had also been retrieved during the experiment. If some relevant documents were missed, this was considered as a recall failure and measured quantitatively. Precision failures were easier to detect since users were asked to judge the retrieved documents as being relevant or nonrelevant. If the user decided that some documents were nonrelevant, this was considered to be a precision failure and measured accordingly. However, identifying the causes of precision failures proved to be much more difficult because the user might have judged a document to be nonrelevant due to index, search, document, and other characteristics as well as the user's background and previous experience with the document. To date, Lancaster's study is the most detailed account of the causes of search failures that has been attempted. As Lancaster points out: The "hindsight" analysis of a search failure is the most challenging aspect of the evaluation process. It involves, for each "failure," an examination of the full text of the document; the indexing record for this document (i.e., the index terms assigned . . . ); the request statement; the search formulation upon which the search was conducted; the requester's completed assessment forms, particularly the reasons for articles being judged "of no value"; and any other information supplied by the requester. On the basis of all these records, a decision is made as to the prime cause or causes of the particular failure under review [58]. + Page 27 + Lancaster found that recall failures have occurred in 238 out of 302 searches, while precision failures occurred in 278 out of 302 searches. More specifically, some 797 relevant documents were not retrieved. More than 3,000 documents that were retrieved were judged nonrelevant by the requesters. Lancaster's original research report contains statistics about search failures along with detailed explanations of their causes. Lancaster discovered that almost all of the failures could be attributed to problems with indexing, searching, the index language, and the user-system interface. For instance, the indexing subsystem in his research "contributed to 37% of the recall failures and . . . 13% of the precision failures" [59]. The searching subsystem, on the other hand, was "the greatest contributor to all the MEDLARS failures, being at least partly responsible for 35% of the recall failures and 32% of the precision failures" [60]. 4.1.3 Blair and Maron's Full-Text Retrieval System Study More recently, Blair and Maron [61] conducted a retrieval effectiveness test on a full-text document retrieval system. They utilized a database that "consisted of just under 40,000 documents, representing roughly 350,000 pages of hard-copy text, which were to be used in the defense of a large corporate law suit" [62]. The tests were based on some 51 queries obtained from two lawyers. Precision and recall were used as performance measures in the Blair and Maron study. The precision ratio was straightforward to calculate (by dividing the total number of relevant documents retrieved by the total number of documents retrieved). Blair and Maron used a different method to calculate the recall ratio. The way they found unretrieved relevant documents (and thus studied recall failures) was as follows. They developed "sample frames consisting of subsets of the unretrieved database" that they believed to be "rich in relevant documents" and took random samples from these subsets. Taking samples from subsets of the database rather than the entire database was more advantageous from the methodological point of view "because, for most queries, the percentage of relevant documents in the database was less than 2 percent, making it almost impossible to have both manageable sample sizes and a high level of confidence in the resulting Recall estimates" [63]. + Page 28 + The results of Blair and Maron's tests showed that the mean precision ratio was 79% and the mean recall ratio was 20% [64]. Blair and Maron found that recall failures occurred much more frequently than one would expect: the system failed to retrieve, on the average, four out of five relevant documents in the database. They showed quite convincingly that high recall failures can result from free-text queries, where the user's terminology and that of the system do not match. Blair and Maron also observed that users involved in their retrieval effectiveness study believed that "they were retrieving 75 percent of the relevant documents when, in fact, they were only retrieving 20 percent" [65]. 4.1.4 Markey and Demeyer's Dewey Decimal Classification Online Project Markey and Demeyer studied the Dewey Decimal Classification (DDC) system "as an online searcher's tool for subject access, browsing, and display in an online catalog" [66]. Two online catalogs were employed in the study: "(1) DOC, or Dewey Online Catalog, in which the DDC had been implemented as an online searcher's tool for subject access, browsing, and display; and (2) SOC, or Subject Online Catalog, in which the DDC had not been implemented" [67]. They also conducted online retrieval performance tests using recall and precision measures to reveal problems with online catalogs and to identify their inadequacies. Precision was defined in their study as the proportion of unique relevant items retrieved and displayed. This definition of precision differs from the one given in Section 3.1 in that it takes into account only retrieved and displayed items (instead of all retrieved items) in the calculation of precision ratio. The researchers made no attempt to have users display and make relevance assessments about all the retrieved items in order to calculate the absolute precision ratio [68]. + Page 29 + Their estimated recall scores were also based on retrieved and displayed items only, not on all the relevant items in the collection. Understandably, they found it impractical to scan the entire database for every query to find all the relevant items in the collection. They used an estimated recall formula "that combined the relevant items retrieved and displayed in the SOC search for a query and the relevant items retrieved and displayed in the DOC search for the same query" [69]. In order to find the estimated recall ratio for each search, the number of unique relevant items retrieved and displayed in one catalog was divided by the total number of unique relevant items retrieved and displayed for the same query in both catalogs. No attempt was made to find other potentially relevant items in the database. The estimated recall scores in the study ranged from a low of 44% to a high of 75%. They found that "searches were likely to retrieve and display a large proportion of relevant items that were unique . . . for the same topic in SOC and DOC" even though DOC's estimated recall was lower than that of SOC [70]. They also asked users if they were satisfied with the search results, and "the majority of patrons expressed satisfaction with the search in the system yielding higher estimated recall" [71]. The average precision scores ranged from a low of 26% to a high of 65% [72]. Considering that only a fraction of items retrieved in the searches were actually displayed, the authors noted that precision was affected by the order in which retrieved items were displayed. They found precision to be a less reliable criterion with which to measure the performance of an online catalog [73]. They asked users which system gave more satisfactory results for their searches and compared users' responses with the precision scores. They concluded that "there was no relationship between patrons' search satisfaction and the precision of their online searches" [74]. Markey and Demeyer also analyzed a total of 680 subject searches as part of the DDC Online Project and found that 34 out of 680 subject searches (5%) failed. Two major reasons for subject search failures were identified as follows: (1) the topic was marginal (35%), and (2) the users' vocabulary did not match subject headings (24%) [75]. Their research report gives a detailed account of the failure analysis of different subject searching options in an online catalog enhanced with a classification system (DDC) [76]. + Page 30 + Markey and Demeyer apparently did not count "zero retrievals" as search failures. Nor did they include in their analysis partial search failures that retrieved at least some relevant documents. Presumably, that's why the number of search failures they analyzed were relatively low. 4.2 Studies Utilizing User Satisfaction Measures It was noted earlier (Section 3.2.2) that analyzing search failures utilizing user satisfaction measures is extremely complicated. Few researchers have attempted to look at search failures in light of user satisfaction. Hilchey and Hurych analyzed 153 online search evaluation forms returned by the users in a university library [77]. Almost half of the respondents (47%) found the search results "most relevant." An additional 32% of the respondents graded the results as "half relevant." Only 6% found all search results relevant. In short, 85% of the respondents felt that search results were at least half relevant. It should be noted that the return rate in this study was about 10%. Although authors claim that the return rate was "unprejudiced in any way," returned questionnaire forms may have primarily come from satisfied users. Ankeny reviewed the studies reporting user satisfaction in end- user search services such as MEDLINE and BRS/After Dark [78]. Most end-users seemed to be satisfied with the online search services. Ankeny also reported the results of two studies that he conducted. In the first study, he surveyed 190 end-users and found that 78% of the users located what they wanted in two business databases (DIALOG Business Connection and Dow Jones News/Retrieval). More than 81% of the users rated the services favorably by giving "an overall rating of 4 or 5 on the five-point scale" [79]. + Page 31 + In the second study, Ankeny surveyed some 600 end-users. He used a stricter measure of search success that had a reliability coefficient of .90. Search success was not measured on a five-point scale in the second study. Rather, in order for a search to be qualified as successful, the user had to answer three questions that affirmed that the user was fully satisfied with the search, found exactly what was desired, and was not dissatisfied in any way. He states: "Of the 600 searches in the sample, 233 met all three criteria for complete success and 367 were less than successful, yielding an overall success rate of 38.8 percent" [80]. Reported reasons for dissatisfaction in 367 "less-than-successful" searches were as follows: system problems; amount, relevancy, or level of the information retrieved; lack of better printed instructions; and lack of more informed and accommodating staff. Kirby and Miller analyzed search failures encountered by MEDLINE end-users employing the Colleague search software [81]. In order to find the search successes and failures, end-users compared their search results with the mediated follow-up search results. "Successful" and "incomplete" end-user searches were identified as follows: "Successful" Colleague searches were those for which the follow-up search added nothing important, as indicated by one of two questionnaire responses: "My search gave satisfactory results, and nothing ESSENTIAL was added by the second search" . . . or "Neither search provided satisfactory results." Both responses were regarded as "successful" in that the end user was no less successful in meeting the information need than the trained search analyst. "Incomplete" Colleague searches were those which had missed important articles, according to end user questionnaire responses after reviewing the follow-up search results" [82, (original emphasis)]. However, end-users were not asked to judge each record retrieved by either search. Rather, "the comparison was based on search terms and combinations recorded on the follow-up search form, and on the number of citations printed in the follow-up search" [83]. Kirby and Miller examined 52 searches. Of the 52 searches, 31 were "incomplete." The major cause of search failures (67.7%) was the search strategy. The rest of the search failures were due to system mechanics and database selection (22.6% and 9.7%, respectively). + Page 32 + 4.3 Studies Utilizing Transaction Logs Several researchers have used transaction logs to study search failures in online catalogs. Dickson [84] studied a sample of "zero-hit" author and title searches using the transaction log of Northwestern University Library's online catalog and analyzed why the searches failed. She found out that about 23% of author searches and 37% of title searches retrieved nothing. Misspellings and mistakes in the search formulation were the major causes of zero-hit searches. Jones [85] examined transaction logs of the Okapi online catalog and identified several unsatisfactory areas in the operation of Okapi due to, among others, spelling errors, failures in subject searching, and user-system interface problems. He analyzed some 300 subject searches performed on Okapi and found that 25% of them failed: "Using relevance assessments based on a display of the first ten records, the experimenter decided that 62.4% of searches were almost certainly successful, 13% may have been successful, 4.5% were collection failures and 25% failed absolutely" [86]. In a follow-up study, it was found that 17 out of 122 sessions (or 13.9%) failed in the Okapi (including 2 sessions that failed due the collection not containing relevant items). (Most sessions contained more than one search.) In 7 sessions, the users' vocabulary did not match that of the catalog (e.g., "sociology of shopping"). Another 4 sessions failed because the topics expressed by the users were too specific (e.g., "textile industry input-output tables"). Two searches failed because searches did not describe users' needs (e.g., one user entered his query simply as "sterling" although the interviewer found out he was actually looking for "economics--sterling shares and gold") [87]. The most recent Okapi report states that "the proportion of (non- aborted) searches which failed to retrieve any records is very low indeed (3.9% overall)" [88]. The authors of the report claim that the improvement is primarily due to: (1) Okapi's "best match" search, and (2) stemming and automatic cross-referencing [89]. + Page 33 + Peters [90] analyzed the transaction logs of a union online catalog (the University of Missouri Information Network) and found that 40% of the searches in that catalog produced zero hits. He classified the causes of search failures under 14 different groups, including typographical and spelling errors (10.9% and 9.9%, respectively) and the search system itself (9.7%). Approximately 40% of the failures were collection failures (i.e., the item sought was not in the database). However, it should be noted that Peters' study was not based on a rigorous analysis of zero-hit searches by re-entering queries to determine the exact causes of failures. Rather, "the analyzers made intelligent guesses . . . of the probable causes" [91]. Hunter [92] analyzed thirteen hours of transaction logs, amounting to some 3,700 searches performed in a large academic library online catalog. She used the same classification schema as Peters and categorized the causes of search failures under 18 different groups. The overall search failure rate in Hunter's study was found to be 54.2%. The major causes of search failures were identified as the controlled vocabulary in subject searching (29%), the system itself (18%), and the typographical errors (15%). However, it was not explained in detail what sorts of controlled vocabulary failures occurred and what the specific causes were. C. Walker and her colleagues [93] obtained similar results when they studied the problems encountered by clinical end-users of MEDLINE and GRATEFUL MED. They defined search failure, which they called "unproductive search," as "one that did not retrieve any citations," and they analyzed 172 such searches [94]. They found that 48% of the search failures occurred because of some flaw in the search strategy. The software in use was responsible for 41% of the search failures. System failures constituted some 11% of all search failures. Zink [95] analyzed transaction logs of 6,118 searches that took place on the WolfPAC online catalog at the University of Nevada. He found that: + Page 34 + more than one of every four (27.81 percent or 1,702) failed to retrieve at least one bibliographical record. Subject searches yielded 667 unsuccessful searches, or 39.19 percent of the total number of unsuccessful searches. Author searches resulted in 250 unsuccessful searches (14.69 percent of the total). Searches by all other criteria accounted for 300 unsuccessful searches (17.63 percent of the total) [96]. Collection failures (57.60%), misspellings (18%), and placing first name "improperly" before last name (15.20%) caused most of the author search failures. Similar failure rates were also observed for the title searches (collection failures, 61.86%, and misspellings, 14.23%). In 111 unsuccessful title searches (22.89%), searchers seemed to be attempting to find subject or author information. Sixty-three percent of the subject searches failed because the user-entered subject words were not "legitimate" Library of Congress subject headings. Misspellings and collection failures accounted for 23.24% and 10.64% of all subject search failures. Most of the studies summarized above benefitted from transaction monitoring to the extent that "zero-hit" searches were identified from transaction logs [97]. Researchers examined the zero-hit searches in order to find out why a particular search query failed to retrieve anything in the database. Unlike Lancaster [98], they did not attempt to identify the causes of recall and precision failures. 4.4 Studies Utilizing the Critical Incident Technique It was mentioned earlier (Section 3.2.4) that Wilson, Starr- Schneidkraut, and Cooper studied searching in MEDLINE using the critical incident technique [99]. The researchers first devised a sampling strategy and developed an interview protocol to elicit the desired information from the subjects. They then developed three "frames of reference" to analyze the interview data: "(1) 'Why was the information needed?,' (2) 'How did the information obtained impact the decision-making of the individual who needed the information?,' and (3) 'How did the information obtained impact the outcome of the clinical or other situation that occasioned the search?'" [100]. After a qualitative analysis of the critical incident reports, the frames of reference were used to create three similar taxonomies. + Page 35 + In the same study, they asked users to explain what they needed the information for and whether they were satisfied with the search outcome. They used incident forms to record the user's account of why a particular search failed or succeeded and, with permission, they tape-recorded the user's comments. They later tried to match these "incident reports" against MEDLINE transaction log records for each search in order to find out the actual reasons for search failures and successes. They examined some 26 user-designated ineffective incident reports in order to "characterize the nature of the ineffective searches, analyze the relationship between what the user said and what the transaction log said happened during the search, and ascertain, by performing an analogous MEDLINE search, whether a search could have been performed which would have met the user's objective" [101]. Most ineffective searches (23 out of 26) were identified as such because the users "could not find what they were looking for and/or could not find relevant materials." An appendix summarizing the analysis of each ineffective search accompanied their research report. After extensive examination of interview transcripts and transaction logs for ineffective searches, the researchers concluded that users did not appear to comprehend: 1. How to do subject searching. 2. How MeSH [Medical Subject Headings] works. 3. How they can apply that understanding to map their search requests into a vocabulary that is likely to retrieve considerably more relevant materials [102]. It appears that critical incident technique can successfully be used in the analysis of search failures in online catalogs as well. Matching incident reports against transaction logs is especially promising. Since the analyst will, through incident reports, gather contextual data for each search query, more informed relevance judgments can be made. Furthermore, this technique also can be utilized to compare user-designated search effectiveness with that obtained through traditional retrieval effectiveness measures. + Page 36 + 4.5 Other Search Failure Studies Some experimental studies looked into strict matching failures that occurred when users tried to do catalog searches. Gouke and Pease [103] analyzed the success rates of the users in matching titles and found that the success rate in finding "nonproblem" titles was 82%, whereas the rate was 48% for "problem" titles. Almost half of the users failed to match simple titles in the online catalog for various reasons (e.g., titles appearing as subject, hyphenated words, words on stoplist, foreign titles, and abbreviations). Alzofon and Van Pulis [104] surveyed 430 users of the LCS online catalog of the Ohio State University Libraries to identify the patterns of searching. They also studied the success rates for known-item and subject searches. They replicated the users' searches on the catalog and found that the author-title search had a success rate of 85% compared with 77% for author searches and 68% for subject searches. Janosky, Smith, and Hildreth [105] studied the errors that users made in performing searches in the LCS online catalog of the Ohio State University Libraries. They hired 30 volunteer students who had no prior experience with the online catalog under investigation. Each student searched four queries in the catalog. (Queries were the same for all students.) They performed one subject search and three known-item searches. Authors summarize the procedure and results as follows: They [users] were asked to search until they either found the item(s) in question or believed that the item(s) was not present in the library system. They were told that it was possible that the item in question was not contained in the library. While searching, subjects were asked to think aloud . . . . A success rate was computed for each search. Since all search items were actually in the library system (subjects were not told this fact), "success" is defined as correctly locating the information requested about an item . . . . For the four searches, the success rate ranged from a high of 58% to a low of 0% [106]. + Page 37 + It appears that users experienced serious problems with the mechanical aspects of searching in this catalog, which in turn influenced the success rate considerably. For instance, "HELP-AUTHOR" was the "correct" help command, and users who entered "HELP AUTHOR" failed to get any help about author searches (notice the hyphen between the two words). On-screen and offline instructions in this system that advised users to type in commands "exactly as listed" did not seem to help users much to recover from such search failures. A more forgiving user interface would have easily prevented similar failures from occurring in the first place. The authors concluded: "It is not sufficient to simply tell users that they have made an error. Failures to deal with the causes of an error often snowballed into a whole string of misinterpretations, resulting in complete failures to solve the problem of using LCS" [107]. 4.6 Related Studies A few studies that were not directly concerned with the causes of search failures, but which nevertheless addressed relevant issues are summarized below. Hildreth considers the "vocabulary" problem as the major retrieval problem in today's online catalogs and asserts that "no other issue is as central to retrieval performance and user satisfaction" [108]. This may be because controlled vocabularies are far more complicated than users can easily grasp in a short period of time. Several researchers have found that the lack of knowledge concerning the Library of Congress Subject Headings (LCSH) is one of the most important reasons why searches fail in online catalogs [109]. Larson [110] found that almost half of all subject searches in MELVYL retrieved nothing. More recently, Larson [111] analyzed the use of MELVYL over a longer period of time (six years) and found that there is a significant positive correlation between the failure rate and the percentage of subject searching. This confirms the findings of an earlier formal analysis of factors contributing to success and satisfaction: "problems with subject searching were the most important deterrents to user satisfaction" [112]. + Page 38 + Larson [113] reviewed the literature on subject search failures in online catalogs along with remedies offered to reduce subject search problems. Subject retrieval failures in online catalogs could be reduced in a number of ways, including assigning more subject headings to bibliographic records, providing keyword searching, and enhancing classification retrieval. Carlyle studied the match between users' vocabulary and LCSH using transaction logs and found that "single LCSH headings match user expressions exactly about 47% of the time" [114]. A study conducted by Van Pulis and Ludy [115] showed that 53% of the users' terms matched subject headings in the online catalog. Vizine-Goetz and Markey Drabenstott extracted queries from transaction logs of three online catalogs (SULIRS, ORION, and LS/2000) and analyzed them "both by computer and manually to determine the extent to which they matched subject headings" [116]. They found that less than half of the subject query terms exactly matched the Library of Congress subject headings. The findings suggest that some search failures can be attributed to controlled vocabularies in online catalogs. However, as the authors note, "such analyses . . . reveal little about whether matching terms satisfactorily represent users' topics of interest" [117]. 5.0 Conclusion It appears that there is no agreed upon definition of what constitutes search failure in document retrieval systems. In part, this is due to the multiplicity of data gathering tools and techniques used in the analysis of search failures (e.g., the critical incident technique, controlled experiments, interviews, questionnaires, talk-aloud techniques, and transaction monitoring). Different data gathering methods have different strengths and weaknesses. + Page 39 + Many of the studies reviewed in this paper examined search failures based on zero retrievals in online catalogs. Partial search failures have been studied much less frequently. Experiments that investigate the relationship between search failures and user needs or characteristics are even scarcer. This is not surprising because identifying zero retrievals from transaction logs is relatively easy and inexpensive. By contrast, analyzing search failures using precision and recall measures is more expensive and time-consuming. So is the investigation of user needs and interests, which could help researchers make more informed judgments about search failures identified through other means. No single method or technique is self-sufficient to analyze all search failures in document retrieval systems and to interpret the findings. As for the causes of search failures, transaction logs of the searches that retrieved nothing in online catalogs reveal that users are having numerous mechanical problems, such as improperly keying commands and misspelling words. Such problems can be alleviated to a certain extent by designing more intuitive user interfaces that would not only take into account user expertise and task complexity, but also would give advice and simplify the user's task [118]. Newer online catalogs are dealing with these problems by incorporating more sophisticated stemming algorithms and Soundex-type techniques to correct misspellings. Transaction log analysis also reveals that users' lack of knowledge of controlled vocabularies and query languages causes many search failures and, subsequently, results in user frustration. Most users are not aware of the role of controlled vocabularies in document retrieval systems. They do not seem to understand the structure of rigid indexing and query languages. Consequently, their search query terms, which are expressed in their own words, often fail to match the titles and subject headings of the documents, causing search failures. "Brittle" query languages based on Boolean logic tend to exacerbate this situation further, especially for complicated search queries. + Page 40 + Transaction monitoring is the most appropriate technique to study search failures when the cause(s) of search failures are obvious (e.g., zero retrievals due to misspellings or collection failures). However, transaction monitoring seems to be less efficient in dealing with more complicated failures. For example, partial failures can be best studied with the help of the user. After all, the user is the key person in the analysis of search failures. It is the user who can explain what he or she was trying to do and whether it was successful. Such input from the user puts each search into perspective and provides much needed contextual information. However, users do not get identified in most transaction log studies. Without user feedback, researchers are faced with the unenviable task of coming up with a rational explanation as to why a particular search failed. Notwithstanding the circumstantial evidence gathered through various online catalog studies in the past, studies examining the match between users' vocabulary and that of online document retrieval systems are scarce. Moreover, the probable effects of mismatching on search failures are yet to be fully explored. Users prefer to be able to express their information needs in natural language, but most contemporary online catalogs cannot accommodate search requests submitted in natural language form. However, it is believed that natural language query interfaces may reduce search failures in document retrieval systems. Natural language search terms will more likely match the titles of the documents in the database. Consequently, the role of natural language interfaces in reducing search failures in document retrieval systems needs to be thoroughly studied. User input should be sought when analyzing search failures with retrieval effectiveness measures such as precision and recall. The same can be said for failure analysis studies that are based on user satisfaction measures. We should strive for full-scale user involvement as much as possible in every stage of analysis of search failures. Despite user participation in the evaluation process, search failures in document retrieval systems are unlikely to be eliminated altogether. However, only through user participation will we find the real causes of search failures and, consequently, build better document retrieval systems. + Page 41 + Notes 1. M. E. Maron, "Probabilistic Retrieval Models," in Progress in Communication Sciences, vol. 5, ed. Brenda Dervin and Melvin J. Voigt. (Norwood, NJ: Ablex, 1984), 145-176. 2. Ibid., 155. 3. Ibid. 4. C. J. Van Rijsbergen, Information Retrieval, 2nd ed. (London: Butterworths, 1979), 10. 5. David C. Blair and M. E. Maron, "An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System," Communications of the ACM 28 (March 1985): 291. 6. S. E. Robertson, M. E. Maron, and W .S. Cooper, "Probability of Relevance: A Unification of Two Competing Models for Document Retrieval," Information Technology: Research and Development 1 (1982): 1. 7. Tefko Saracevic, "Relevance: A Review of and a Framework for the Thinking on the Notion in Information Science," Journal of the American Society for Information Science 26 (1975): 321-343. See also: Michael Eisenberg and Linda Schamber, "Relevance: The Search for a Definition," in ASIS '88: Proceedings of the 51st ASIS Annual Meeting, Atlanta, Georgia, October 23-27, 1988, ed. Christine L. Borgman and Edward Y .H. Pai. (Medford, NJ: Learned Information, 1988), 164-168. 8. An attempt has been made in Cranfield II to plot recall/fallout graphs. The size of the collection used in this experiment was relatively small (1,400 documents) and many tests were done with 200 documents. Nevertheless, no analysis has been performed to find out the causes of fallout failures. For details, see: Cyril Cleverdon, Jack Mills, and Michael Keen, Factors Determining the Performance of Indexing Systems, Volume 1, Design (Cranfield, England: Aslib, 1966); and Cyril Cleverdon and Michael Keen, Factors Determining the Performance of Indexing Systems, Volume 2, Test Results (Cranfield, England: Aslib, 1966). + Page 42 + 9. Ray R. Larson, "Between Scylla and Charybdis: Subject Searching in the Online Catalog," in Advances in Librarianship, vol. 15, ed. Irene P. Godden (San Diego, CA: Academic Press, 1991), 188. See also: S. E. Wiberley and R. A. Dougherty, "Users' Persistence in Scanning Lists of References," College & Research Libraries 49 (1988): 149-156. 10. J. L. Kuhns implied that frustration usually occurs when a user reaches his or her "futility point" in a given search. The futility point is defined as "the number of retrieved documents the inquirer is willing to browse through before giving up his search in frustration." Source: David C. Blair, "Searching Biases in Large Interactive Document Retrieval Systems," Journal of the American Society for Information Science 31 (July 1980): 271. 11. Michael Buckland and Fredric Gey, personal communication, 1991. 12. Robert Wages, "Can Easy Searching be Good Searching? A Model for Easy Searching," Online 13 (May 1989): 80. 13. William S. Cooper, "On Selecting a Measure of Retrieval Effectiveness," Journal of the American Society for Information Science 24 (1973): 87-100, 413-424. Compare this with: Dagobert Soergel, "Is User Satisfaction a Hobgoblin?," Journal of the American Society for Information Science 27 (July-August 1976): 256-259. 14. Ibid., 88. 15. Judith A. Tessier, Wayne W. Crouch, and Pauline Atherton, "New Measures of User Satisfaction with Computer-Based Literature Searches," Special Libraries 68 (November 1977): 383-389. 16. Marcia J. Bates, "Factors Affecting Subject Catalog Search Success," Journal of the American Society for Information Science 28 (May 1977): 161-169. 17. Mark T. Kinnucan, "The Size of Retrieval Sets," Journal of the American Society for Information Science 43 (January 1992): 73. + Page 43 + 18. Susan E. Hilchey and Jitka M. Hurych, "User Satisfaction or User Acceptance? Statistical Evaluation of an Online Reference service," RQ 24 (Summer 1985): 455. 19. Renata Tagliacozzo, "Estimating the Satisfaction of Information Users," Bulletin of the Medical Library Association 65 (April 1977): 248. 20. Ibid. 21. Melvon L. Ankeny, "Evaluating End-User Services: Success or Satisfaction," Journal of Academic Librarianship 16 (January 1991): 356. 22. Ibid., 354. See also: Ethel Auster and Stephen B. Lawton, "Search Interview Techniques and Information Gain as Antecedents of user satisfaction with Online Bibliographic Retrieval," Journal of the American Society for Information Science 35 (March 1984): 90-103. 23. Sandra R. Wilson, Norma Starr-Schneidkraut, and Michael D. Cooper, Use of the Critical Incident Technique to Evaluate the Impact of MEDLINE. (Palo Alto, CA: American Institutes for Research, 1989), AIR-64600-9/89-FR. For hypothetical examples as to the importance of unretrieved but relevant documents, see: Soergel, "Is User Satisfaction a Hobgoblin?," 258-259. 24. Ankeny, "Evaluating End-User Services," 356. 25. Debora Cheney, "Evaluation-Based Training: Improving the Quality of End-User Searching," Journal of Academic Librarianship 17 (July 1991): 155. 26. Tefko Saracevic and Paul Kantor, "A Study of Information Seeking and Retrieving. II. Users, Questions, and Effectiveness," Journal of the American Society for Information Science 39 (May 1988): 177-196. + Page 44 + 27. Tefko Saracevic, Paul Kantor, Alice Y. Chamis, and Donna Trivison, "A Study of Information Seeking and Retrieving. I. Background and Methodology," Journal of the American Society for Information Science 39 (May 1988): 161-176. Note that it is not discussed in this paper how they calculated the precision/recall ratios and what figures (i.e., number of records (a) retrieved, (b) relevant, (c) not relevant) they obtained. As they stressed several times in their report, the recall figures they obtained were not absolute but comparative. For a more detailed account, see Part II of their article. 28. Saracevic and Kantor, "A Study of Information Seeking and Retrieving. Part II," 193. 29. Ibid. 30. Ray R. Larson, "The Decline of Subject Searching: Long Term Trends and Patterns of Index Use in an Online Catalog," Journal of American Society for Information Science 42 (April 1991): 198. 31. Charles W. Simpson, "OPAC Transaction Log Analysis: The First Decade," in Advances in Library Automation and Networking, vol. 3, ed. Joe A. Hewitt (Greenwich, Conn.: JAI Press, 1989), 35-67. 32. J. Dickson, "Analysis of User Errors in Searching an Online Catalog," Cataloging & Classification Quarterly 4 (Spring 1984): 19-38; Thomas A. Peters, "When Smart People Fail: An Analysis of the Transaction Log of an Online Public Access Catalog," Journal of Academic Librarianship 15 (November 1989): 267-273; Rhonda N. Hunter, "Successes and Failures of Patrons Searching the Online Catalog at a Large Academic Library: A Transaction Log Analysis," RQ 30 (Spring 1991): 395-402; and Steven D. Zink, "Monitoring User Search Success through Transaction Log Analysis: the WolfPac Example," Reference Services Review 19 (1991): 49-56. + Page 45 + 33. Martha Kirby and Naomi Miller, "MEDLINE Searching on Colleague: Reasons for Failure or Success of Untrained End Users," Medical Reference Services Quarterly 5 (1986): 17-34; and Cynthia J. Walker et al., "Problems Encountered by Clinical End Users of MEDLINE and GRATEFUL MED," Bulletin of the Medical Library Association 79 (January 1991): 67-69. 34. Hunter, "Successes and Failures," 401. 35. Stephen Walker and Micheline Hancock-Beaulieu, Okapi at City: An Evaluation Facility for Interactive Information Retrieval (London: The British Library, 1991), British Library Research Report 6056; and Ray R. Larson, "Classification Clustering, Probabilistic Information Retrieval and the Online Catalog," Library Quarterly 61 (April 1991): 133-173. 36. Stephen Walker and Richard M. Jones, Improving Subject Retrieval in Online Catalogues, 1: Stemming, Automatic Spelling Correction and Cross-Reference Tables (London: The British Library, 1987), 139, British Library Research Paper 24. See also: R. Jones, "Improving Okapi: Transaction Log Analysis of Failed Searches in an Online Catalogue," Vine no. 62 (1986): 3-13. 37. Larson, "The Decline of Subject Searching," 198. 38. Sharon Seymour, "Online Public Access Catalog User Studies: A Review of Research Methodologies, March 1986-November 1989," Library and Information Science Research 13 (1991): 97. 39. Micheline Hancock-Beaulieu, Stephen Robertson and Colin Neilson, "Evaluation of Online Catalogues: Eliciting Information from the User," Information Processing & Management 27 (1991): 532. 40. John C. Flanagan, "The Critical Incident Technique," Psychological Bulletin 51 (1954): 327. 41. Wilson, Starr-Schneidkraut and Cooper, Use of the Critical Incident Technique to Evaluate the Impact of MEDLINE, 2. + Page 46 + 42. Sammy R. Alzofon and Noelle Van Pulis, "Patterns of Searching and Success Rates in an Online Public Access Catalog," College & Research Libraries 45 (March 1984): 110-115; Marcia J. Bates, "Subject Access in Online Catalogs: a Design Model," Journal of American Society for Information Science 37 (1986): 357-376; Christine L. Borgman, "Why are Online Catalogs Hard to Use? Lessons Learned from Information-Retrieval Studies," Journal of American Society for Information Science 37 (1986): 387-400; Pauline A. Cochrane and Karen Markey, "Catalog Use Studies Since the Introduction of Online Interactive Catalogs: Impact on Design for Subject Access," Library and Information Science Research 5 (1983): 337-363; Mary Noel Gouke and Sue Pease, "Title Searches in an Online Catalog and a Card Catalog: A Comparative Study of Patron Success in Two Libraries," Journal of Academic Librarianship 8 (July 1982): 137-143; Charles R. Hildreth, Intelligent Interfaces and Retrieval Methods for Subject Searching in Bibliographic Retrieval Systems (Washington, DC: Cataloging Distribution Service, Library of Congress, 1989); Beverly Janosky, Philip J. Smith, and Charles Hildreth, "Online Library Catalog Systems: An Analysis of User Errors," International Journal of Man-Machine Studies 25 (1986): 573-592; Neal N. Kaske, A Comprehensive Study of Online Public Access Catalogs: an Overview and Application of Findings (Dublin, OH: OCLC, 1983), OCLC Research Report # OCLC/OPR/RR-83- 4; Cheryl Kern-Simirenko, "OPAC User Logs: Implications for Bibliographic Instruction," Library Hi Tech 1 (1983): 27-35; Ray R. Larson, "Workload Characteristics and Computer System Utilization in Online Library Catalogs" (Ph.D. diss., University of California at Berkeley, 1986); Gary S. Lawrence, V. Graham, and H. Presley, "University of California Users Look at MELVYL: Results of a Survey of Users of the University of California Prototype Online Union Catalog," Advances in Library Administration 3 (1984): 85-208; Karen Markey, Subject Searching in Library Catalogs: Before and After the Introduction of Online Catalogs (Dublin, OH: OCLC, 1984); Karen Markey, "Users and the Online Catalog: Subject Access Problems," in The Impact of Online Catalogs, ed. J.R. Matthews. (New York: Neal-Schuman, 1986), 35-69; Joseph K. Matthews, A Study of Six Public Access Catalogs: a Final Report Submitted to the Council on Library Resources, Inc. (Grass Valley, CA: J. Matthews and Assoc., Inc., 1982); Joseph Matthews, Gary S. Lawrence, and Douglas Ferguson, eds., Using Online Catalogs: a Nationwide Survey. (New York: Neal-Schuman, 1983); and Chih Wang, "The Online Catalogue, Subject Access and User Reactions: A Review," Library Review 34 (1985): 143-152. + Page 47 + 43. Examples of such studies are (in chronological order): Cyril W. Cleverdon, Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems (Cranfield, England: Aslib, 1962); Cleverdon, Mills and Keen, Factors Determining the Performance of Indexing Systems, Volume 1, Design; Cleverdon and Keen, Factors Determining the Performance of Indexing Systems, Volume 2, Test Results; F. W. Lancaster, Evaluation of the MEDLARS Demand Search Service. (Washington, DC: US Department of Health, Education and Welfare, 1968); F. W. Lancaster, "MEDLARS: Report on the Evaluation of Its Operating Efficiency," American Documentation 20 (1969): 119-142; Dickson, "Analysis of User Errors in Searching an Online Catalog"; Blair and Maron, "An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System"; Jones, "Improving Okapi: Transaction Log Analysis of Failed Searches in an Online Catalogue"; Karen Markey and Anh N. Demeyer, Dewey Decimal Classification Online Project: Evaluation of a Library Schedule and Index Integrated into the Subject Searching Capabilities of an Online Catalog (Dublin, OH: OCLC, 1986), Report Number: OCLC/OPR/RR-86-1; Kirby and Miller, "MEDLINE Searching on Colleague"; S. Walker and Jones, Improving Subject Retrieval in Online Catalogues; Wilson, Starr-Schneidkraut, and Cooper, Use of the Critical Incident Technique to Evaluate the Impact of MEDLINE; Simone Klugman, "Failures in Subject Retrieval," Cataloging & Classification Quarterly 10 (1989): 9- 35; Peters, "When Smart People Fail"; Ankeny, "Evaluating End- User Services: Success or Satisfaction"; Hunter, "Successes and Failures"; C. Walker et al., "Problems Encountered by Clinical End Users of MEDLINE and GRATEFUL MED"; and Zink, "Monitoring User Search Success through Transaction Log Analysis: the WolfPac Example." 44. Cleverdon, Report on the Testing and Analysis; Cleverdon, Mills and Keen, Factors Determining the Performance of Indexing Systems, Volume 1, Design; and Cleverdon and Keen, Factors Determining the Performance of Indexing Systems, Volume 2, Test Results. 45. Cleverdon, Report on the Testing and Analysis, 1. 46. Ibid., 8-9. + Page 48 + 47. Ibid., 89. The design and findings of the Cranfield I experiment have been criticized by many authors. For example, see: Don R. Swanson, "The Evidence Underlying the Cranfield Results," Library Quarterly 35 (1965): 1-20. For a review of the Cranfield tests, see: Karen Sparck Jones, "The Cranfield Tests," in Information Retrieval Experiment, ed. Karen Sparck Jones (London: Butterworths, 1981), 256-284. 48. Ibid., 11. 49. Swanson, "The Evidence Underlying the Cranfield Results," 5. 50. This percentage was obtained by averaging the figures given in the fifth column of Table 3.1 of Cleverdon, Report on the Testing and Analysis, 22. 51. This summary is based on Cleverdon, Report on the Testing and Analysis, Chapter 5. The report also includes the complete summary of the analysis of search failures (Appendix 5A) and "some examples of the complete analysis of the individual documents" (Appendix 5B). 52. Ibid., 88. 53. Cleverdon, Mills, and Keen, Factors Determining the Performance of Indexing Systems, Volume 1, Design; and Cleverdon and Keen, Factors Determining the Performance of Indexing Systems, Volume 2, Test Results. 54. Cleverdon and Keen, Factors Determining the Performance of Indexing Systems, Volume 2, Test Results, i ("Summary"). For the detailed performance figures along with recall/precision graphs, see volume 2 of the full report. 55. Lancaster, Evaluation of the MEDLARS Demand Search Service. 56. Ibid., 16, 19. 57. Ibid., 19-20. 58. Lancaster, "MEDLARS: Report on the Evaluation of Its Operating Efficiency," 123. + Page 49 + 59. Ibid., 127. 60. Ibid., 131. 61. Blair and Maron, "An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System." 62. Ibid., 290-291. 63. Ibid., 291-293. 64. Ibid., 293. 65. Ibid., 295. 66. Markey and Demeyer, Dewey Decimal Classification Online Project, 1. 67. Ibid., 109. 68. Ibid., 162. 69. Ibid., 144. 70. Ibid., 146. 71. Ibid., 149. 72. Ibid., 165, Table 42. 73. Ibid., 162. 74. Ibid., 166. 75. Ibid., 182. 76. Ibid.; especially, see Chapter 8, 173-291. 77. Hilchey and Hurych, "User Satisfaction or User Acceptance?" 78. Ankeny, "Evaluating End-User Services," 352-354. 79. Ibid., 354. + Page 50 + 80. Ibid. 81. Kirby and Miller, "MEDLINE Searching on Colleague." 82. Ibid., 20. 83. Ibid. 84. Dickson, "Analysis of User Errors in Searching an Online Catalog," 26. 85. Jones, "Improving Okapi: Transaction Log Analysis of Failed Searches in an Online Catalogue." 86. Ibid., 7-8. 87. S. Walker and Jones, Improving Subject Retrieval in Online Catalogues, 117-119. 88. S. Walker and Hancock-Beaulieu, Okapi at City, 30. The authors also surveyed the users to find out if they were satisfied with their search results using a five-point satisfaction scale. Ninety-five out of a total of 120 users (or 80%) indicated that they were satisfied with the search outcome (they marked 4 or 5 on the scale), 19 users (or 16%) "had some reservations" (i.e., they marked 3 on the scale), and 6 users (or 4%) "were negative" (i.e., they marked 1 or 2). Ibid., 24-25. 89. Ibid., 31. 90. Peters, "When Smart People Fail." 91. Ibid., 270. 92. Hunter, "Successes and Failures." 93. C. Walker, et al., "Problems Encountered by Clinical End Users of MEDLINE and GRATEFUL MED." 94. Ibid., 68. + Page 51 + 95. Zink, "Monitoring User Search Success." 96. Ibid., 51 97. The following studies should be exempted from this as their analyses were not based on zero-hit searches only: Jones, "Improving Okapi: Transaction Log Analysis of Failed Searches in an Online Catalogue"; S. Walker and Jones, Improving Subject Retrieval in Online Catalogues; and S. Walker and Hancock-Beaulieu, Okapi at City. 98. Lancaster, Evaluation of the MEDLARS Demand Search Service. 99. Wilson, Starr-Schneidkraut and Cooper, Use of the Critical Incident Technique to Evaluate the Impact of MEDLINE. 100. Ibid., 5. 101. Ibid., 81. 102. Ibid., 83-84. 103. Gouke and Pease, "Title Searches in an Online Catalog and a Card Catalog," 139. 104. Alzofon and Van Pulis, "Patterns of Searching and Success Rates in an Online Public Access Catalog," 113. 105. Janosky, Smith and Hildreth, "Online Library Catalog Systems: An Analysis of User Errors." 106. Ibid., 576. 107. Ibid., 591. 108. Hildreth, Intelligent Interfaces and Retrieval Methods for Subject Searching in Bibliographic Retrieval Systems, 69. + Page 52 + 109. Bates, "Subject Access in Online Catalogs: a Design Model"; Borgman, "Why are Online Catalogs Hard to Use? Lessons Learned from Information-Retrieval Studies"; David R. Gerhan, "LCSH in vivo: Subject Searching Performance and Strategy in the OPAC Era," Journal of Academic Librarianship 15 (1989): 83-89; Klugman, "Failures in Subject Retrieval"; David Lewis, "Research on the Use of Online Catalogs and Its Implications for Library Practice," Journal of Academic Librarianship 13 (1987): 152-157; Karen Markey, "Users and the Online Catalog: Subject Access Problems," in The Impact of Online Catalogs, ed. J.R. Matthews. (New York: Neal-Schuman, 1986), 35-69; Wang, "The Online Catalogue, Subject Access and User Reactions: A Review." 110. Larson, "Between Scylla and Charybdis: Subject Searching in the Online Catalog," 181. 111. Larson, "The Decline of Subject Searching," 208. 112. University of California Users Look at MELVYL: Results of a Survey of Users of the University of California Prototype Online Union Catalog. (Berkeley, CA: The University of California, 1983), 97. 113. Larson, "Classification Clustering, Probabilistic Information Retrieval and the Online Catalog," 136-144 114. Allyson Carlyle, "Matching LCSH and User Vocabulary in the Library Catalog," Cataloging & Classification Quarterly 10 (1989): 37. 115. Noelle Van Pulis, and L.E. Ludy, "Subject Searching in an Online Catalog with Authority Control," College & Research Libraries 49 (1988): 528-529. 116. Diane Vizine-Goetz and Karen Markey Drabenstott, "Computer and Manual Analysis of Subject Terms entered by Online Catalog Users," in ASIS '91: Proceedings of the 54th ASIS Annual Meeting. Washington, DC, October 27-31, 1991, ed. Jose-Marie Griffiths (Medford, NJ: Learned Information, 1991), 157. + Page 53 + 117. Ibid., 161. 118. Michael K. Buckland and Doris Florian, "Expertise, Task Complexity, and Artificial Intelligence: A Conceptual Framework," Journal of American Society for Information Science 42 (October 1991): 635-643. Acknowledgements The helpful comments and suggestions of the referees are gratefully acknowledged. About the Author Yasar Tonta, Ph.D. candidate, School of Library and Information Studies, University of California, Berkeley, CA 94720. ----------------------------------------------------------------- The Public-Access Computer Systems Review is a refereed electronic journal that is distributed on BITNET, Internet, and other computer networks. There is no subscription fee. To subscribe, send an e-mail message to LISTSERV@UHUPVM1 (BITNET) or LISTSERV@UHUPVM1.UH.EDU (Internet) that says: SUBSCRIBE PACS-P First Name Last Name. PACS-P subscribers also receive two electronic newsletters: Current Cites and Public- Access Computer Systems News. This article is Copyright (C) 1992 by Yasar Tonta. All Rights Reserved. The Public-Access Computer Systems Review is Copyright (C) 1992 by the University Libraries, University of Houston. All Rights Reserved. Copying is permitted for noncommercial use by computer conferences, individual scholars, and libraries. Libraries are authorized to add the journal to their collection, in electronic or printed form, at no charge. This message must appear on all copied material. All commercial use requires permission. -----------------------------------------------------------------