Chapter 3. A review of IR system evaluation
The focus of this section is on how the methods and past studies of IR evaluation can shape our understanding of what has been or can be achieved in evaluation of search engines. It is not and cannot be an attempt to review the approaches, issues, methods, underlying assumptions, findings or results of the many valuable evaluation studies of IR. Rather we broadly categorise the criteria on which evaluations are based to obtain an in-depth understanding of what is evaluated, the goal of evaluation, and the implications of evaluation in a complex situation with many interrelated system and context parameters. The section concludes with an insight into the shortcomings of the use of a user satisfaction construct as a surrogate measure of system success which leads us to consider the alternative perspective as conceptualised in the proposed framework.
3.1 Cranfield studies
Evaluation of comparative systems has a long tradition of improving the state of the art in IR technology. Researchers and system developers would like to test the truth of their theories about IR and/or to demonstrate a marked improvement in retrieval performance. The criterion for the evaluation of performance effectiveness has, in the main, been based on the overall goal of a retrieval system, to retrieve relevant documents. Such evaluations adopt the Cranfield-experimental model based on relevance, a value judgement on retrieved items, to calculate recall (or its surrogate, relative recall) and precision. These dual measures are then presented together where recall is a measure of effectiveness in retrieving all the sought information in a database, and precision assesses the accuracy of the search. In an experimental environment all variables can be controlled except those independent variables of interest (such as the indexing language) and comparisons based on these measures of effectiveness.
Many and various criticisms, problems and concerns have been leveled at the validity and reliability of a Cranfield approach to IR evaluation (e.g., Ellis, 1984). Most fundamental is the compromise necessary in defining relevance for such experimentation. For example, it is necessary to assume that relevance judgements can be made independently, that is a user will read each document as if new, without being affected by what they have learnt through reading previous ones. Furthermore, the predefined concept of relevance judgements on which recall & precision measures are based makes the assumption that relevance can ignore the many situational and psychological variables which in the real world affect relevance, which as Large et al (1999) state is "in the eye of the beholder". In an interdependent system there may be many manifestations of relevancy unique to an information need which, in turn, is unique to a particular individual (with inherent variation). To assess the validity of the Cranfield measures for IR evaluation requires an understanding of relevance. The extent of the measurement errors introduced by variations in relevance assessments and "missed" relevant documents is essentially unknown, but has been shown not to affect relative results in comparative tests over a number of queries (Lesk and Salton, 1968). Indeed, in the defense of the Cranfield approach in which test collections are used, with information queries and preserved judgement sets, extensive sampling can be used in order to 'offset these compromises' (Salton 1992)
As such, this long standing approach to IR evaluation using precision and recall computed from a large body of evaluation data has come to be known as the traditional or default model of retrieval testing. More recently, since 1992, the approach has been embodied in TREC (Text REtrieval Conferences, funded by NIST/(D)ARPA) where participants use a standard large-scale test collection (Harman 1995, 1996) and compare system performance using standard evaluation measures of precision, aspectual recall, elapsed time and search satisfaction
3.1.1 Cranfield-type evaluations of search engines
The majority of studies evaluating search engine performance (Ding and Marchionini, 1996; Gauch and Wang, 1996; Tomaiuuolo and Packer, 1996; Chu and Rosenthal, 1996; Clarke and Willett, 1997) are based on some notion of relevance and thus regarded as Cranfield designs (Harter & Hert, 1997). For example, Leighton and Srivastava (1999) evaluated five search engines based on precision on the first 20 results returned for 15 queries asked at a University library reference desk. 'Top 20' precision rates the services based on the percentage of relevant results within the first 20 returned and uses a variant that adds weight for ranking effectiveness. Overall AltaVista, Infoseek and Excite performed best. They speculate that this may in part be due to cleaner databases with low duplicate or dead links, and common features which allow the user to control the search, such as case sensitivity for capitals. Typically, however, when used to evaluate search engine performance these measures are not computed from extensive evaluation data under test conditions to consider a limited set of environment variables. The appropriateness of this approach for the evaluation of web search engines must be questioned with respect to the origins, purpose and assumptions made in the use of recall and precision measures.
3.1.1.1 Recall
A major limitation is that search engines and the web do not provide for the controlled environment. As a result a majority of the studies which use a Cranfield approach report performance based on the precision measure only. Leighton and Srivastava (1999) chose to base the performance measure on precision alone because they argue that in their study, involving undergraduates precision is more important to the user than recall. That is, searches tend to be exploratory rather than comprehensive. This is a highly debated topic and is touched on in the next section on utility. The calculation of precision, however, assumes that the database is partitioned into retrieved and not retrieved. This is not the case in ranked output, hence the calculation of precision at various cut off points. Further, in essence, true recall cannot be calculated for searches in a web space because the total number of items returned by a search engine is too great. Thus it is not possible to calculate the number of potentially relevant items for a given query in such a huge and dynamic database. Given the dual nature of these measures it would seem advisable to at least attempt to approximate recall by the pooled method pioneered by the TREC experiments. Clarke and Willett (1997) developed such a method for comparing the recall of three sets of searches conducted on the different indexed collections of AltaVista, Excite and Lycos. Relative recall (the proportion of relevant documents retrieved with one engine amongst all relevant documents found using all search engines and strategies) was calculated by checking how many of the relevant documents found by one were present in the coverage of the other search engines.
3.1.1.2 Dynamic database
In the design of any evaluation experiment or investigation it is important that steps are taken to avoid the introduction of bias favouring one service over another. In the web environment the dynamic state of the databases searched (as indexes are generated using autonomous search robots) presents a particular difficulty for comparative evaluation. Most of the studies undertaken do acknowledge this and stress the need to run the queries on all engines at the same time, or in the briefest possible time period. The intention being to prevent bias towards a later evaluation where an engine has been able to index/retrieve new pages or re-fresh its index.
3.1.1.3 Relevance criteria
A further difficulty in use of these performance measures in an operational environment is that a lack of standardisation of the criteria for the relevance judgements across the various studies makes any attempt for comparison virtually meaningless. Tomaiuuolo and Packer (1996) for instance, do not define their criteria for relevance; others, such as Chu and Rosenthal (1996), have used a three level scoring method of 1 for relevant, 0.5 for partially/somewhat relevant, and 0 for irrelevant. However, as Oppenheim et al (2000) point out many have developed their own schema for scoring these points. Clarke and Willett (1997), for example, assigned the score 0.5 if a page consisted of a series of pages which lead to one or more relevant pages and, 0 to sites which could not be opened ("file not found" error message) or because of excessively low response times. Duplicates sites were penalised and scored as 0 (as in Leighton and Srivastava, 1999), but mirror sites were scored as unique. Nasios et al (1998) assigned one of five marks to each hit categorised as A - a best possible result; B - fairly relevant that partially or superficially covered the query theme or contained a link to a A type page; C - an irrelevant hit; X - failed to retrieve a web page due to broken link or server error; and D indicated a duplicate hit. Humphries and Kelly (1997) used a five score system where 4 was assigned to an authoritative site; 3 to an informative; 2 to an uninformative; 1 to unrelated/ irrelevant; and 0 - error. Leighton and Srivastava (1999) based their criteria on Mizzaro's (1997) framework of the concept of relevance which views the relevance relationship as three components. 1) topic relates the information resource, (the document or information contained within) to the subject area of need. 2) task relates the resource to what the user wants to do with the information; and 3) context relates to everything else, that is, what the user already knows, what reading level the resource is at, how much time and money the resource will cost etc. They, the researchers, then defined the criteria for relevance categories based on topic along with anticipated tasks or information needs that would be represented by the request for the topic. Whilst they did not employ actual users to evaluate the results, stating that no other study reviewed had done so (p874), they could be seen to attempt to encompass an element of end-user relevance criteria for the evaluation of an operational system.
3.1.1.4 Queries
A final limitation in adopting a Cranfield approach for comparative evaluation is the requirement that the query expressions are kept constant across the engines. In general this results in the use of query expressions in their most basic form. Statements in the majority of studies were entered with no use of search features such as operators, modifiers or quotes. This, it could be argued, is a realistic approximation of the type of searching done by most users. Jansen et al (1998) found that of over 50,000 searches performed by 18,000 Excite users less than 7% used AND, and that +/- and double quotes were used in fewer than 6% of searches. However, Leighton and Srivastava (1999) using only unstructured or natural language queries suggest that the choice of search expression was a weak point in the design. It could be argued that in adopting a precision measure of system performance an underlying model of a user/ searcher is assumed. As a result, the most that can be said is that for a given set of users the system performed at this level.
Oppenheim et al (2000) conclude that the idiosyncratic approaches adopted by evaluative studies of search engines based on Cranfield render these inconclusive. They suggest that further evaluation of performance should consider alternatives to recall and precision. For example, Expected Search Length (Cooper, 1968) can calculate 'cost' in the sense of the number of sites a user looks at before sufficient items are examined to satisfy the query. They also recommend for a measure of performance the Back and Summers (unpublished) method which involves asking users to categorise a percentage relevance score to each hit. Both recommendations, it is noted, involve end users directly in making some judgement on the retrieved hits. An alternative evaluation infrastructure to reliably perform repeatable experiments in the context of the www is use of the Web track, new to TREC 8. The track used a frozen snapshot of the web as its document collection, known as VLC2, representing approx. 18.5 million web pages and 10,000 queries from logs from AltaVista and Electric Monk SE2. Participants submitted the top 20 documents for all 10,000 queries from which 50 queries were selected to judge the retrieved top 20. Results verified the ability of these systems to handle large amounts of data.
3.2 User-oriented evaluation of interactive retrieval systems
Search engine developers' responses to the TREC results reported at the Infornortics 5th search engines meeting are documented in a review from Chris Sherman "The Fireworks Fly" (2000). In defense of the relatively poor performance levels reported, developers considered the basing of performance on binary relevance judgements to be a poor match of the systems' objectives. Whilst the Cranfield approach to evaluation gives an objective measure of performance, it is based on the assumptions made with regards to the definition of relevance (in a sense, a predefined output). Suggestions of search engines' objectives as offered in Sherman's report include speed of results, getting information from users, browsing categories, and promoting popular sites. As the representative from AltaVista stated, search engines increasingly differ from each other in significant ways because no one model (an analogy was made to car models) will satisfy all needs. The objections to TREC imply that search engine developers would adopt, in preference, a methodology which attempted to evaluate the functionality of such developments from a user perspective. Further, given that it is acknowleged that search engines are targeting market niches, possibly user groups or types of queries, such contextual information will be important in any evaluation undertaken.
Alternative approaches to the evaluation of IR systems which involve the end user address the well known shortcomings of Cranfield, specifically its predefined output and input which the measures assume. The Cranfield methodology generally excludes the user (with an information need) from making the relevance judgement for the basis of the measures, which may indeed be inappropriate or incomplete measures from a searcher's point of view. With the advent of end user searching and for the evaluation of operational systems it has been argued that actual users of the system should make the relevance judgements to obtain a more realistic assessment of system performance from a user point of view. Furthermore, the approach arguably treats the system as a black box (Robertson and Hancock-Beaulieu, 1992) in making an assumption that the retrieval situation will be static, that is a one-off (offline) retrieval situation, with limited, if any, consideration of interactive searching by end users. Harman (2000) commenting on TREC as a test-collection-based evaluation points out that as such what is measured is the initial set of results users would see after they input a query but before any interaction. Whilst this point of measurement is important, and some users are satisfied with this, Harman states that "the average precision measure has strong recall component. The recall performance will only be further improved by user interaction and appropriate new tools." Put another way, in a search conducted outside of these experimental conditions, a lack of precision may be owing to a searcher's reluctance to expend effort in narrowing a search. Recent systems, including web searching, support a dynamic model of IR permitting interactive searching. In this model, it cannot be assumed that some preliminary preparation of query has been done, to put a one-off well-formed query to system, but rather the user will undertake extensive query reformulation via direct interaction with the system. An interactive IR system is thus one in which users' goals and strategies change in responding to messages of the system, and as Robertson and Beaulieu (1992) state "the rise of the interactive system has made evaluation methodologies that leave the user outside the system less and less tenable".
3.2.1 Utility
The involvement of users in the relevance assessment for performance evaluation based on recall and precision presents the difficulty that users will bring to the assessment whatever subjective criteria they wish, which with respect to a genuine (rather than invented) query is dynamic and situated in a moment of time-space. Indeed, research into end-user criteria for relevance has revealed a wide range of factors, other than topic, which may be bought to bear on the judgement (Barry & Schamber, 1998). Such that it has been widely debated that a measure to gauge system effectiveness should be based on the utility, not topic relevance, to the user of the documents retrieved. In such a user-orientated evaluation of system performance the user, seeking utility of the documents retrieved, can be influenced by a number of user factors, such as the situation context of the query, the psychological state of the user (e.g., frustration level) and logistics (e.g., time available).
Significantly, for a user centered evaluation of system performance, proponents of a utility measure raise some doubts as to the compatibility of the assumption that systems should aim to high recall and precision performance (Cooper, 1973). In their review Harter and Hert (1997, p.15) point to research which support Cooper's Utility Theory suggesting that users are not interested in topicality, precision, and exhaustive high recall but in the usefulness of the documents retrieved. Cleverdon (1991) asserts that recall is rarely a user requirement in operational systems. Meadow (1986) suggested that users are unconcerned with precision, and Sandore (1990) found that precision did not correlate with user satisfaction. More recently Su (1992) in an empirical investigation sought a single measure of system success for the evaluation of interactive systems from a user-perspective. She justifies the requirement for an evaluation methodology for system comparison and choice which involves the end user and their information problems in realistic operational IR situations. To this end, she posed the question whether, by correlating twenty measures of retrieval performance with users' overall judgement of system success, a single best indicator of a successful performance could be found. Her correlation identified seven significant variables and based on the strength of the correlation she found 'value of search results as a whole' to be the best single measure. This measure of utility, distinguished from the criteria of aboutness used in a relevance measure, was based on the users' satisfaction with and value of the retrieved items as a whole with respect to the actual usefulness of the items to the information searcher.
The arguments for testing the performance of operational systems based on user judgement of output relevance or utility are strong, but add a layer of complexity to the evaluation methodology. Recall and precision measures can be applied to demonstrate, for example, the effectiveness of techniques for stemming in retrieval systems. The implications, in terms of understanding the influence of system components, on performance based on utility measures require careful consideration and interpretation. A range of system factors, such as those offered by search engine developers (see above, such as speed of operation, quality assurance of results to presentation of results) could impact on a user's judgement of system success based on utility. For example, a user's judgement of the value of the results may be partially determined by how novel the results are. In such an instance the order of presentation of the results is likely to impact on this judgement. Equally, a user may be influenced by the speed at which he/she is able to identify useful documents enabled partially by the effectiveness of the ranking technology. Few studies, however, attempt to investigate the impact of system components or mechanisms on user judgement of the search results. The closest in evaluative studies of search engines which investigate system features as an explanation for the results obtained and impact on user satisfaction are those which look at ranking.
3.2.1.1 Ranking
Courtois and Berry (1999) expressed their surprise in finding that little research has been done on search engines ranking of documents in response to simple search queries, given that "results ranking has a major impact on users' satisfaction with web search engines and their success in retrieving relevant documents." They go on to point out that whilst judging relevance of the first 10 to 20 retrieved items may be effective in determining precision, it is not how users use the result list. Rather they are more likely to scan the list and retrieve only selected documents. In their research they judged the ranking of search engines based on the criteria "all terms" (are documents that contain all search terms ranked higher?), "proximity" (are documents which contain all terms ranked higher where the search terms are contained as a contiguous phrase?), and "location" (are documents which contain all search terms ranked higher where the search terms are contained in titles, headings or metatags?). They then speculated on the linking of the results to search features of the systems. For example, Lycos performed well on the "all terms" criterion and the default use of the operator AND may have enabled this. AltaVista performed well on the "proximity" criterion which may be a result of its weighting for proximity in the ranking algorithm. The results for "location" were however reported to be low across all the search engines. Finally, they report that results varied widely by search topic in that some yielded consistent ranking while others produced lists with a few documents that contained all terms scattered among many that did not.
3.2.2 Usability
Usability studies aim to involve the user more in the evaluation in indicating the factors which influence IR interaction and provide some understanding as to why or how these impact on performance. Harter and Hert (1997, p.42) draw on the HCI literature for its definition as a measure of "system ability to provide an effective, efficient, satisfying performance of the users task". The usability of retrieval systems have been researched by a range of measures such as accuracy, error rate, action/ process variables (number of commands, descriptors, screens accessed, search cycles), retrieval (e.g, recall), user perceptions of ease of use and satisfaction. Further investigations have analysed the relationships of such measures with user characteristics such as cognitive abilities.
A range of system features may help a user to formulate a query and work with a system to obtain the desired results, possibly to attain the performance levels of recall and precision. A menu of options or a template in addition to the query box might offer assistance to users who are unfamiliar with creating effective search syntax. The relatively intuitive interfaces of some engines take into account that, on average, most people do not search effectively. Thus the intention is to prevent disappointment, or worse satisfaction with results retrieved from an inept search. Indeed, the interface (and non-retrieval devices) may affect the whole mode of interaction for the user and hence influence the demands the user indirectly puts on the back end search technology. Several listings and comparisons of search engine features, such as query formulation tools, can be found in the literature (such as, Dong and Su (1997); Feldman (1998); Kimmel (1996); Winship (1995)). These comparative listings do not however evaluate the features with respect to their impact on search performance, at least in any systematic or controlled manner. Such an evaluation of the functionality of interactive mechanisms would be desirable given the rapid advance of interface technology as a major area of research and development in these systems. The difficulty, however, is how (if performance is measured by some effectiveness measure(s)) to determine the impact of the back-end index and search mechanisms from the front-end tools which affect users' interaction and thus the demands that they make on the technology. All features of a system (and arguably contextual factors such as user characteristics) will have some impact on user interaction, searcher performance, and in turn on the actual system performance.
The problem posed for the evaluation of interactive systems is illustrated in the wide range of issues and interactive features studied under Interactive TREC. The Interactive track, added at TREC's 3rd annual conference, has the goal to develop evaluation methodologies for the interactive task (Harman 1996); that is, an investigation of the process as well as the outcome in interactive searching. Participants are encouraged to investigate different (user) approaches to conducting a TREC search task and investigate reasons for the results obtained. Researchers have used this venue to investigate a range of issues in comparing searcher performance using different systems/ interfaces. For example, investigations have been carried out on the use and utility of relevance feedback and ranking in interactive IR; the effects of topic order, difficulty, and domain on performance; the effects of using visualisation techniques; the extent to which searchers develop new searching behaviours; and, to investigate the effectiveness of different styles of interaction. (Voorhees and Garofolo, 2000)
In the context of TREC 8 interactive track, Fowkes and Beaulieu (2000) examined searching behaviour with a relevance feedback system to test a hypothesis that feedback would lead to better performance and searchers would prefer the system with relevance feedback. The findings on searching behaviour were related to the query formulation and reformulation stages of an interactive search process. Overall the norm was to use between 2 and 4 single query terms extracted from the given topic descriptions, and the queries were reformulated in only 15% of the searches. No statistical difference was found in the performance with retrieval with/without relevance feedback, and 75% of searchers did not perceive any difference between the two systems. Further analysis identified 3 levels of task-characteristics according to the degree of [searcher] interpretation needed to define a topic. This provided some understanding for how different task characteristics influenced search behaviour. Relevance Feedback came into play in different ways dependent on topic complexity. Automatic query expansion was found to be effective in improving simple queries but for more complex queries interactive query expansion with contributions from both searcher and system appeared to be more effective.
3.2.3 Searcher contexts
The realities of retrieval situations, as represented by the activities of users, define many contextual characterisations of users and tasks as parameters to be captured for an evaluation of interactive systems and facilities. Investigations which have sought to identify factors or (searcher traits as) predictors of search performance help to define these external variables of retrieval setups. With the shift from trained intermediaries to novice end users of IR systems came much research into the impact of individual differences on searcher behaviour and performance. For example, Saracevic et al. (1988) lists such research which study, for example, differences in users' search experience, training, cognitive characteristics, and perception of the information need on online searching. In their investigation of the nature of information seeking behaviour, Saracevic et al. examined five aspects of users, questions, searchers, searches, and outputs. A major outcome was the correlation of system performance with these variables thus identifying the external (user) factors that impact on search performance and which system designers should be aware. It is of interest that Beaulieu et al. (2000) analysed a further two stages of the interactive search process, browsing and selecting documents, and viewing full documents, with respect to user behaviour. Again it was found that different contexts influenced search behaviour. For example, two styles of browsing emerged in users' examination of the documents retrieved. When multiple answers to the query were found searchers worked systematically; but when few answers could be identified searchers were more selective. Scanning was seen to be the most prevalent strategy for evaluating document content, leading the researchers to posit that searchers seek considerable contextual information before making a relevance judgement. Further they state that passage retrieval, where the searcher is taken to the best passage that represents the highest scoring document section in relation to query term occurrence, was found to be disorienting and counter intuitive when searching on less familiar topics.
3.2.4 User satisfaction
Evaluation of a retrieval system's performance can thus be conducted in the abstract context of a Cranfield test or in an operational environment involving end users. The latter, as the above review indicates, presents serious challenges with the additional layers of complexity with respect to its design. The users' cognitive state (especially their understanding of the information need) will constantly change as they interact with the system and view documents. Such learning effects necessitate large and costly samples to replicate the search for system comparison. Furthermore the intrinsic variation in user needs and cognitive characteristics of the searcher are linked in some way to the relevance decisions they make, and to the use and value of different search facilities. A judgement of utility will be subjective and depending on what is important to the user different system features will impact on this judgement. Evaluation of interactive features (impact on search performance) must be undertaken in a complex test environment where searcher behaviour will impact on search performance and the context of the user's information need will affect the usefulness of the system search features. This all makes for generalisations and reliable comparisons about IR performance difficult.
An alternative approach to evaluation from a user perspective is to attempt to understand how users of the system themselves evaluate performance. The construct of user satisfaction used in system evaluation aims to achieve such a summary expression of users' perceptions based on the usefulness of a system. Throughout the history of evaluation, subjective measures concerning user satisfaction with search experience have been gathered. Lancaster and Warner (1993) report that such studies have consistently shown accessibility and ease of use to be the prime factors influencing the choice of an information source. Our review of user satisfaction and search engines (3.3) would seem to confirm the emergence of key influencing factors, but also reveals the multiple dimensions on which evaluation from a user perspective can be based. The reason for this complexity stems from the variety of criteria, based on user requirements, on which users may judge system success and the variations in user contexts which impact on an expression of system satisfaction. Tessier et al (1977) put forward three assumptions which they claim imply how satisfaction should be measured:
- that user's satisfaction will be a function of how well the product fits their requirement;
- that the user's state of satisfaction is experienced within the framework of their expectations;
- that people may seek a solution within an acceptable range instead of an ideal or perfect solution.
The remainder of this review then seeks evidence for these assumptions which form the maxims for the proposed framework set out in Chapter 4 for the evaluation of search engines from a user perspective.
3.3 User satisfaction based evaluations of search engines
Stobart and Kerridge ( 1996) revealed users' choice of engines to be dictated by speed of access, and other factors such as size, habit, accuracy of data, user friendliness and the interface. Nahl (1998) involved users in rating their self-confidence, stress level, understanding of the topic, satisfaction, and usefulness. It was found that ease of use and fast response time were important elements in determining self-confidence, stress and satisfaction levels. Furthermore Nahl concluded that "a search engine is perceived in the context of the information content it gives access to". This would seem to indicate that a user's perception of ease of use and thus value of a search tool is influenced by the extent to which the search results are of interest to the searcher. Nasois et al (1998) reported that the search results in their investigation of search engine capabilities were judged according to whether they would satisfy an easily pleased user or hard to please user. The suggestion is that user-traits may impact on the judgement of system success. Golovchinsky (1996) reported that users' view of recall increased with [increasing] number of articles displayed on the screen simultaneously. This would suggest that system characteristics may impact on users' perception of performance.
Su and Chen (1999) proposed a methodology for a dimensional approach to the evaluation of search engine performance from a user perspective. Based on a tested methodology (Su 1992, 1998) fifteen performance measures were grouped under five criteria of, relevance, efficiency, utility, user satisfaction, and connectivity. In recognition of the contributory factors of user characteristics in IR performance and evaluation, these were grouped under personal and educational backgrounds, user information needs/search requirements, and search strategies. Eleven participants were recruited to search for their topic on each of the four engines, AltaVista, Infoseek, Lycos and Opentext. Each participant made relevance judgements of the retrieved items and chose the five most relevant from the "top 20" and ranked them in decreasing order of relevance so that user and engine ranking of retrieved items could be compared. Participants were also interviewed to obtain ratings and reasons for satisfaction and utility. The pilot study found a number of differences among the 4 engines with none dominating in every aspect of the multi-dimensional evaluation. Lycos retrieved the highest number relevant and partially relevant items and had the highest mean precision ratio. However, in-spite of this, users assigned higher satisfaction ratings on precision for AltaVista. Lycos had the best rank correlation with users' relevance ranking. Although, AltaVista had the highest validity of links, satisfaction with online document, search interface and output format, and the highest value of the search results as a whole.
Wang et al (1999) approached the evaluation dimensions from a different viewpoint, that of the customer utilising the service. Their study was carried out using modified SERVQUAL dimensions to measure users' expectations and perceptions of search engines where good service quality is that which matches or exceeds expectation. Summarised here, these again show a number of dimensions and associated measurement criteria on which users might evaluate a search service. Tangibles (info is well organised; different search methods are available; a large amount of information is available; can narrow search topic). Reliability (good syntax consistency for the keywords in searching; search results are relevant to query). Responsiveness (search results are provided quickly). Assurance (no repartition of pages/sites; no dead links; information is up to date). Empathy (the layout on first impression is easy to understand; offers natural language searching; there are help screens, introductory pages or sample searches to guide the user; offers language selection for documents written in specific language). Preliminary analysis of service quality indicated that user understanding of and satisfaction with the quality of search engines are low. Among other suggestions they identify the biggest problem faced by searchers is the "needle in the haystack" phenomenon and state "the ability to refine a query in a sensible way is very important to improving the quality of search engines" (p506)
3.4 Chapter summary
The evaluation of a search engine's performance in a controlled environment meets an important objective of the system, to retrieve relevant items for a given query. Its limitations, however, have focussed the question of how to evaluate search engines from a user-perspective based on the utility of the retrieved items and the usability of the system itself given the complex interaction of many user and system variables on performance. The use of user satisfaction as a surrogate measure has a long-standing tradition in evaluation studies. Based, however, on Tessier's assumption user satisfaction is a function of how well the system fits a user requirement, it follows that a variety of criteria may be used on which to base a measure of user satisfaction. Furthermore any measure of user satisfaction in itself is limited if there is no consideration of the system itself which has lead to a user's judgement only an assumption that users who have higher scores are using the better systems. In the majority of the studies reported above which used in combination objective measures of system performance and subjective measures of satisfaction seemingly contradictory results were obtained. For example, in Su (1998) it was reported that users with a low expectation of finding information expressed high satisfaction with a set of low precision results. It is proposed in our study that a framework for evaluation is needed if we are to make sense of such results which seem to confirm Tessier's assumption that a user's state of satisfaction is expressed in a framework of expectation.
Our conceptual framework for an evaluation of user satisfaction views a retrieval system as a means by which some individual performs some goal-directed task. To this end, user satisfaction is a multidimensional construct, which will vary across user and query contexts. The need to develop a framework for the evaluation of system contribution to the search process is articulated well in Belkin et al. (in Harter and Hert, 1997. p26) "if we are going to serious about evaluating effectiveness of interactive IR, we need to develop … new performance measures. … that we develop measures based upon the search process itself and upon the task which has lead the searchers to engage in the IR situation." User studies have begun to give some sense of what users are doing during IR interaction and provide models of valuable conceptualisations of the IR process. The reality however is that there is little consensus on what epitomizes the Information seeking phenomena, and by extension different perspectives on a model may lead to different focus for evaluation. For this reason, we draw on a general model of the information task and define user satisfaction measures within this theoretical view to focus in the evaluation on the degree to which system characteristics supports the user in their task needs. Our model suggests that users will give higher evaluations based not only on inherent characteristics of a system, but also on the extent to which that system meets their task needs and their individual abilities. Therefore a single system could get very different evaluations from users with different tasks, needs and abilities. Thus our framework for evaluation will incorporate a means by which we can evaluate the usefulness of a system with respect to the task the end user is undertaking.
[ Previous Section - Chapter Two ] [ Next Section - Chapter Four ]
