This site's design is only visible in a graphical browser that supports web standards, but its content is accessible to any browser or Internet device.

mmu | cerlim

DEvISE > Final Report: Development of the Framework
or use the sitemap

Chapter 4. Development of the Framework

4.1 Introduction

Our aim is to develop a framework for the evaluation of Internet search engines from a user perspective. Towards this end we posit that user satisfaction is a complex multidimensional construct. It is not, however expressed by the user in the abstract but rather it constitutes some judgement of how well the service or technology fits a user's requirements. Thus in the framework for evaluation, user satisfaction must be defined within this theoretical basis to link system characteristics to their possible impact on the user task. In this section we present a preliminary construction of such a framework intended to structure existing measures and variables to provide a meaningful system evaluation from a user perspective. The small scale implementation described is not intended as an evaluation of the search engines used, as such, but rather as a means to test the feasibility of the proposed framework and to gather user data which may be used in its refinement towards an evaluation tool.

The framework proposed is conceptualised as user satisfaction with a system is a function system-task fit and is expressed in a moderating context of user requirement.

User evaluation criteria
A general statement of the information retrieval task is that a user interacts with a retrieval system in order to retrieve specific items that will satisfy an information need. Based on a general model of the retrieval process we derive statements of user requirements, what a goal directed user is trying to do. Our premise is that meaningful user satisfaction measures can be obtained for system evaluation when defined within these dimensions of the IR task. That is, user satisfaction is an elicited response to the extent to which the system supports a task and can be evaluated by criteria which are related to what the user is trying to do suggested by the dimensions.

Measures
To obtain some user evaluation along these dimensions, each criterion was operationalised by a set of measures which we considered reflected the user task. This perspective on user satisfaction measures enabled us to link in the system features which support the user in the retrieval task.

Context
The framework proposed further seeks to incorporate a moderating context which may cause users to make different demands on the system which, it follows, will lead to varying user evaluation of the usefulness or functionality of the system features.

In the empirical investigation conducted with a view to developing such an evaluation framework we thus set out to better understand user evaluations of system satisfaction, that is how users are satisfied and on what criteria. User data were collected and analysed accordingly as follows:

Whilst the test did not set out to evaluate the impact of system features on users' ratings, in the spirit of a feasibility investigation we did seek to find whether users' expression of satisfaction were simply random or whether they were meaningful evaluations of the given system characteristics. By basing our measures on system features which may support a user in a task dimension we expected to observe variations of satisfaction ratings across search engines which differ in the way they support the retrieval task.

In the framework proposed it is suggested that user evaluation of the system may be moderated by some contextual characterisation of user and information query. The impact of this context on user satisfaction was explored in the testing of the framework. That is, we wanted to find if our characterisation of context led to systems receiving different evaluations. The contexts that seem to have the most impact could be used to develop the framework for a system evaluation which links system features with user evaluations as dependent on certain user/task contexts or under which context a system obtains higher rating, and thus features supporting certain tasks.

This chapter sets out the development of the proposed framework for evaluation and details its implementation for the feasibility study. Since the investigation is not intended to be an evaluation of the search engines as such we refer to the engines used in the study as SystemA, SystemB, and SystemC.

4.1.1 IR task/process models

The identification of a process model, which proposes assumptions as to what the user is trying/wants to do, provides the rationale for our measures. The specific task domain is, users wish to retrieve relevant items to satisfy their information need. Although individuals' information seeking goals can differ quite widely, standard models of the information seeking process contain the core steps of query specification, receipt of results in an interactive cycle. The process model on which we draw (Salton, 1989) identifies interacting steps, which are not necessarily sequential and may be repeated. This gives the dimensions on which users might evaluate system success/ satisfaction. These are

  1. Users will formulate/submit a query;
  2. Users will receive results;
  3. Users will evaluate results - end or modify (Note, a possible feedback loop here); and
  4. Users will evaluate success of the search as a whole

For each dimension, we can relate the criteria, Effectiveness, Efficiency, Utility and Interaction, by which users might evaluate system satisfaction on these task dimensions.

Dimension 1 Users will formulate/submit a query evaluated on the criterion of interaction (query)
Dimension 2 Users will receive results evaluated on the criterion of interaction (output)
Dimension 3 Users will evaluate results evaluated on criteria of effectiveness & relevance/ranking
Dimension 4 Users will evaluate success of the search evaluated on criteria of efficiency & utility

We justify the use of this standard model in that it describes the basics of the retrieval process. However, it is important to note that this model has been contrasted with others, such as Bates' (1989) berrypicking model which challenges the view that the information need will remain static throughout the process and that the main value of the search resides in a set of retrieved documents. This alternative model then emphasises the interaction which takes place whereby a user learns, goals are triggered, and information acquired along the way.

The task model, based on a simplified model of the information access process (Hearst, 1999):

Information Access Model - Hearst

4.1.2 Measures

Measures were developed which defined user evaluations for each criterion along each of the dimensions. Each criterion was thus unpacked to the group of variables on which user satisfaction with an interactive retrieval system can be measured. The measures were identified in the process of defining each criterion when related to the IR task/process dimensions. The intention being to develop user satisfaction evaluation variables which, in the framework, relate to system function (features) and will provide for a system rating based on task fit in an end user searching environment. The majority came from existing (and generally accepted) measures. Those developed for the proposed evaluation framework were mapped, in a sense, to the (function of) system features which supported the dimension in question.

Table 8 Framework for the evaluation of SEs from a user's perspective

Dim1 User satisfaction with Effectiveness (SE features) Dim2 User satisfaction with Efficiency (SE features) Dim3 User satisfaction with Utility (Output) Dim4 User satisfaction Interaction (Interface)

1.1 Precision1 (traditional measurement)

1.2 Precision2 - user satisfaction with precision

1.3 P3, comparison of P1 and P2

1.4 Ranking1 - system

1.5 Ranking2 - user satisfaction with ranking

1.6 R3, comparison of R1 and R2

2.1 Search session time

2.2 Response time

2.3 Relevance assessment time (in situ)

3.1 Value of search results as a whole

3.1.1 Satisfaction with results

3.1.2 Resolution of the problem

3.1.3 Rate value of participation

3.1.4 Quality of sources

3.2 Validity of links

3.3 Number of links followed up

4.1 User satisfaction with output display / visualisation of representation of item

4.1.1 User satisfaction with manipulation of output

4.1.2 User satisfaction with visualisation of representation of item

4.2 User satisfaction with interface

4.2.1 User satisfaction with query input

4.2.2 User satisfaction with query modification

4.2.3 User satisfaction with query visualisation/clarification


4.2 Task Dimensions and User Measures

Four dimensions were identified from IR task/process models, and were used as the basis on which to suggest the criteria on which users might evaluate or rate system

Dimension1 Users will evaluate results

Criterion Retrieval performance (effectiveness) will affect user evaluation of SE
Measures of retrieval effectiveness are based on the notion of relevance, and are based on the assumption that given a document collection and a query some documents are relevant to the query and some are not. The objective of the IR system is to retrieve relevant documents and to suppress the retrieval of non-relevant documents. System output can then be evaluated on the basis of how well these objectives are met. Most used are the measures of recall and precision. A user's evaluation of a SE will be partially dependent on the ability of the system to meet these basic criteria.

In a web based environment with direct user interaction these traditional measures may not be appropriate, instead other relevance based measures may be used to provide criterion for evaluating effectiveness in the performance of the system. That is, relevance will not be measured on binary (relevant/non-relevant) scale but instead the concept of relevance will encompass non binary judgements relative/partial differentiated into situation 'usefulness' or 'utility or topicality' - that is assessment categories viewed as dimensions of information needs.

Measures This dimension can be evaluated by:

1.1 Precision1 (traditional measurement)
1.2 Precision2 - user satisfaction with precision
1.3 P3, comparison of P1 and P2
1.4 Ranking1 - system
1.5 Ranking2 - user satisfaction with ranking
1.6 R3, comparison of R1 and R2

In the empirical investigation data was gathered and analysed on the measure based on Precision2 - user satisfaction with precision and Ranking2 - user satisfaction with ranking. Users were asked to rate on a three point scale the degree of relevance of each item retrieved, leaving it open as to how many individual items each participant assesses. Participants were then asked to rate on a five point scale their satisfaction with the precision of the search results. Satisfaction with ranking order was obtained on five point scale. An overall rating of effectiveness was obtained by the users' assessment of the overall success of the search engine in retrieving items relevant to the information problem or purpose on a five point scale.

Dimension2 Users will evaluate success of the search as a whole

Criterion Efficiency will affect user evaluation of SE
Efficiency seems a little hard to define, but basically is concerned with how efficient the system is in retrieving the required information. Boyce et al (1994) highlight the difference between effectiveness and efficiency thus, "an effectiveness measure is one which measures the general ability of a system to achieve its goals. It is thus user oriented. An efficiency measure considers units of goods or services provided per unit of resources provided." (p.241). They also state that if the service or good is not judged to be effective then efficiency has little meaning. Also, Dong and Su (1997) state that "response time is becoming a very important issue for many users" (p.79). Therefore, a user's evaluation of a SE will be affected by the system's efficiency. The premise being that users want to retrieve information as efficiently as possible, which may in part equate to as quickly as possible.

Measures: A user will evaluate the dimension by:
2.1 Search session time
2.2 Response time
2.3 Relevance assessment time (in situ

In the empirical investigation the search session time was noted and used in the analysis. Participants were asked to rate on a five point scale the overall success of the search engine in retrieving items efficiently.

Dimension3 (Output)

Criterion Utility will affect user evaluation of SE
Authors have also argued for other measures, such as utility so that an information system is evaluated on the basis of how useful it is to its users. Utility has been defined as "the degree of actual usefulness of answers to an information seeker" (Saracevic and Kantor, 1988, p.169). Utility measures are based on users' expressions of degree of satisfaction and value of the retrieved items as a whole. The utility approach highlights many factors, other than relevance, which will affect a user's evaluation of the system's performance. Cleverdon (1991) argued (with Cooper, 1973), who put forward a straight utility-theory single measure) that retrieval effectiveness measures should be used in combination with more user-oriented measures. The aim being to produce an evaluation on utility with other such factors as subjective satisfaction statements, search costs, time spent etc which could be related to nominal recall and precision values in such a way as to indicate how various parameters ought to be changed

It is clear that utility is concerned with the degree of actual usefulness of retrieved items to the user, yet Saracevic and Kantor (and Su, 1998, p.558) report that standard utility measures do not exist. Saracevic and Kantor used the following evaluative statements: How much time spent reviewing abstracts; Assign a cost value to usefulness of results; What contribution this information made to resolution of problem that motivated your question; Overall how satisfied with results. Indeed, various factors may bear on users' judgements of overall satisfaction with the value of the search results. For example, depending on the user and their information need, users may be influenced by: the extent to which information quality can be assumed based on the source; the extent to which the information is accurate or correct; and, the extent to which the information is specific, or at the right level, to user need. In the context of web evaluation we supplemented an 'overall satisfaction with value of search results' with the three variables of Validity of links, Number of links followed up, and Quality of sources which may impact on user evaluation of satisfaction with the search engine

Measures A user will evaluate the dimension by:
3.1 Value of search results as a whole
3.1.1 Satisfaction with results
3.1.2 Resolution of the problem
3.1.3 Rate value of participation
3.1.4 Quality of sources
3.2 Validity of links
3.3 Number of links followed up

In the empirical investigation participants were asked to rate on a five point scale the worth of their participation, with respect to the information which resulted; the contribution the information made to the resolution of the problem; satisfaction with results; the quality of the results; and the value of the search results as a whole. Participants were asked to rate on a five point scale the overall success of the search engine in terms of the actual usefulness of the items retrieved.

Dimension4 Users will formulate/submit a query & Users will receive results

Criterion Interaction will affect user evaluation of SE
Interaction is a concept which is often discussed but little defined. In the context of web SEs it is how the user directly interacts and manipulates/commands the system to retrieve the information or specific items they require. Interaction will be largely determined by satisfaction measures alone. Belkin and Vickery (1985) state "satisfaction as a criterion for evaluation of information systems is a concept explicitly intended to extend the range of factors relevant to the evaluation. In particular, the intention is to move away from evaluation according to system performance, the basis of information retrieval, and toward an overall judgement based on user reaction to the system" (p.194).

Based on the features of search engines which might support a user in the IR task of submitting/ formulating a query, we defined the measure of 'user satisfaction with interface' as comprising measurement on three variables of user satisfaction with query input, query modification, and query visualisation. User satisfaction with query input may be influenced by the perceived ease by which the user can express a query. For example the availability of different search methods, such as natural language searching or power search to narrow a search topic. User satisfaction with query modification may be influenced by assistance provided in formulating the search, such as suggesting query terms or offering a feedback mechanism. User satisfaction with query visualisation may be influenced by any provision in helping the user in understanding the impact of a query. An obvious example is the use of folders which could have multiple impact on the user's understanding of the query, such as suggesting different perspectives of the topic or information which might be useful in a different search.

On receiving results the user will be involved in some process of interpreting the results in the given frame of the information need. On a general level users would want to easily see why an item was retrieved and to quickly see its meaning to make a relevancy judgement. Features of a search engine which might support a user in this task lie in its summary representation of items for visualising the 'aboutness' of item, and extent to which information is presented in clear and organised manner . We defined the measure of 'User satisfaction with output display' as comprising measurement on the variables relating to manipulation of the output (e.g. summary display features (category labels), sort by) and visualisation of item representation.

Measures A user will evaluate the dimension by:

4.1 User satisfaction with output display
4.1.1 User satisfaction with user manipulation of output
4.1.2 User satisfaction with visualisation of representation of item

4.2 User satisfaction with interface
4.2.1 User satisfaction with query input
4.2.2 User satisfaction with query modification
4.2.3 User satisfaction with query visualisation/clarification

In the empirical investigation data was gathered on user satisfaction on all of these measures using a five point scale.

4.2.1 Context

There are many factors which could be used to characterise the user context by, for example, user traits, experience, background, cognition; the information request, subject, type users expectation, perception or understanding of the request. (Note, also searcher behaviour , search strategies, tactics will make different demands affect performance and thus user evaluation) Sitting on top of our model of the IR task process used in the development of the evaluation criteria is the context that users have an information need. Thus in our evaluation framework as possible moderation of user evaluations we characterise this context by factors such user intent/ amount of prior knowledge/ expectation.

User context was characterised by responding to the questions

In addition, we incorporated such questions as (from Koll, 2000): Searching for known item; Searching for an unknown item; Searching for any item; Searching for the most relevant item; Searching for most of the items; Searching for all of the items; Searching for affirmation that there are no items; Searching for like items; Searching for new items to supplement items already obtained previously.

4.3 Implementation

Twenty three participants were recruited from second year students of the Department of Information and Communications, MMU. A short introduction was given to the participants a few days prior to their search session to explain to them the project to which they were contributing and to present them with the Information Need and User Characteristic questionnaire which they were required to complete before their search session. No restrictions were placed on the type of information they required or the purpose for which it was intended. The following table presents participants' characteristics.

Table 9 User characteristics

Characteristic Variable options Number of cases Percentage
Gender Male
Female
7
16
30
70
Age 18-20
21-30
31-40
41-50
51-60
61-70
70+
6
6
7
3
1
0
0
26
26
30
13
4
0
0
Academic status 2nd year 23 100
IR experience None
Some
Lots
1
16
6
4
70
26
Computer experience None
Some
Lots
0
14
9
0
61
39
Internet experience None
Some
Lots
0
14
9
0
61
39

The participants were split into two groups searching on two different days. On arrival each student was given a second questionnaire (three copies - one for each SE) which was concerned with participants' ratings of dimensions and measures of each SE. They were instructed to read this before commencing searching. Each student was required to search three particular SEs in an order specified by the Test Administrator, the order of which was varied to remove learning curve effect by a 3x3 Latin square. Therefore, each SE was searched in each of the three positions by an equal number of participants. A short introduction was given to the participants prior to searching. The introduction included: 1) order of SEs to use; 2) how to print-out results; 3) how long to search for (free choice) and, 4) what to search for (free choice).

Participants were asked to search for an information need of their choice, to use as many reformulations as required and to search for as long as they would under normal conditions. This was to be repeated on the remaining two SEs. Once they retrieved a set of results, i.e. a hitlist, they were asked to print these out. From this hitlist they made relevance judgements which they marked on the printout and handed these in with their completed questionnaires. These relevance judgements were based on a set of guidelines given to each participant before searching. These guidelines defined relevance in terms of a three point scale where R = relevant, PR = partially relevant and NR = not relevant.

In some instances participants were unable to complete three whole searches within the time of the session (two hours). In these cases the Test Administrator accepted only a completed test on a SE. Fifteen participants completed the test on all three SEs, one participant completed the test on two SEs and seven participants completed the test on one SE. In this way 54 searches were collected during the test.

4.4 Data Analysis

Proposition 1

Our primary aim was to test the assertion that users' evaluation of a system based on satisfaction measures is multidimensional, that is overall satisfaction is not a single construct but a response to how well the system has supported the IR task which may be made on many dimensions.

In the quest to better understand how users evaluate these systems we asked users to give a system an overall success rating. By correlating ratings assigned on the four criteria, as suggested by the task dimensions, we aim to find which, if any, appears be the most important or contributory factor to users' overall judgement.

Table 10 Global and SE level - Overall success rating correlated against the four criteria

Criterion
Correlation coefficient
  Global SystemA SystemB SystemC
Effectiveness .759** .779** .795** .729**
Efficiency .817** .843** .908** .741**
Utility .710** .362 .930** .806**
Interaction .592* .511* .660* .580*

* = moderate strength correlation
** = strong correlation

Globally the criterion with the strongest correlation with users' overall rating is Efficiency, followed by Effectiveness, Utility and Interaction. On SystemA the strongest correlation is Efficiency, followed by Effectiveness, Interaction and Utility - where a weaker correlation is demonstrated. On SystemB the strongest correlation is Utility, followed by Efficiency, Effectiveness and Interaction. On SystemC the strongest correlation is Utility, followed by Efficiency, Effectiveness and Interaction.

The strength of the correlation ratings assigned on the four criteria with users' overall success judgement in this study indicates that user satisfaction is a multidimensional construct and that the measures used were valid. That is, a user judgement of system satisfaction is based on a response to the extent to which the system supports the many dimensions of the IR task. The Efficiency criterion held the strongest correlation with the success judgement, and the Interaction criterion held the lowest. This could suggest that efficiency is the most important criterion in the users' minds when assigning a success rating, and that the users in this study have little interest in system interaction.

By further correlation of measures within each criterion we ask does a user's (low or high) rating on a single variable lead to a low/high rating on the related criterion

Table 11 Global and SE level - Measures correlated within Effectiveness criterion

Measure
Correlation coefficient
  Global SystemA SystemB SystemC
Satisfaction with relevance .733** .864** .639* .794**
Satisfaction with ranking .485* .620* .371 .541*

* = moderate strength correlation
** = strong correlation

Both Globally and at SE level the strongest correlation between Effectiveness and individual measures is satisfaction with relevance, followed by satisfaction with ranking.

Table 12 Global and SE level - Measures correlated within Efficiency criterion

Measure
Correlation coefficient
  Global SystemA SystemB SystemC
Time taken in minutes .062 .018 -.150 .336

* = moderate strength correlation
** = strong correlation

From these results it can be seen that a negligible correlation between time taken to search and the Efficiency criterion exists.

Table 13 Global and SE level - Measures correlated within Utility criterion

Measure
Correlation coefficient
  Global SystemA SystemB SystemC
Rate value of participation -.557* -.600* -.394 -.682*
Resolution of problem .723** .756** .782** .687*
Satisfaction with results .755** .552* .963** .769**
Value of results as a whole .742** .594* .913** .772**
Overall quality of results .702** .512* .804* .833**

* = moderate strength correlation
** = strong correlation

Globally the strongest correlation between an individual measure and Utility is satisfaction with results. The rate value of participation measure has a negative correlation which indicates that as Utility rises the value of participation decreases. On SystemA the strongest correlation is resolution of the problem; on SystemB the strongest correlation is satisfaction with results; and on SystemC the strongest correlation is overall quality of results.

Table 14 Global and SE level - Measures correlated with Interaction criterion

Measure
Correlation coefficient
  Global SystemA SystemB SystemC
Importance of ability to change output .132 -.186 .345 .171
Ease of understanding item/s from hitlist .452* .400* .524* .431*
Satisfaction with input facility .486* .361 .600* .488*
Importance of ability to modify query .437* .476* .656* .305
Satisfaction with presentation of query .506* .573* .427* .524*
How helpful was Help .190 .286 -.027 .325

* = moderate strength correlation
** = strong correlation

Globally the individual measure with the strongest correlation is satisfaction with presentation of the query which is of moderate strength. This is followed by satisfaction with facility to input query, ease of understanding item/s from the hitlist, importance of ability to modify query, how helpful was Help and importance of ability to change output. These latter two demonstrate weak correlations.

On SystemA the strongest correlation is satisfaction with presentation of query, while satisfaction with input facility, how helpful was Help and importance of ability to change output, demonstrate weak correlations. On SystemB the strongest correlation is importance of ability to modify query, while importance of ability to modify query and how helpful was Help show a weak correlation. On SystemC the measure with the strongest correlation is presentation of query, with how helpful was Help, importance of ability to modify query and importance of ability to change output demonstrating weak correlations.

The implementation of the test was intended to be exploratory of user evaluations and the framework rather than an evaluation of the systems as such. For this reason we sought to validate our measures as those which are important from a user perspective when evaluating or making some judgement of a system. Towards this end we included open-ended questions to collect user-derived reasons for attributing satisfaction rating with the system as a whole and rating for each dimension. In-depth analysis, such as categorisation of the some 250 comments, was considered to be beyond the scope of this feasibility study. These comments are, however, used to substantiate our interpretation of the above data analysis.

The correlations of the user ratings on the measures within each criterion would seem to indicate the validity of these measures. User satisfaction with relevance held the strongest correlation with user ratings of the criterion Effectiveness. User-derived reasons for assigning ratings of success on this criterion would seem to confirm this finding.

"Information retrieved was extremely relevant to my needs"
"Although not all items retrieved were relevant those that were were very important ones"
"Most items retrieved appear to have some relevance"
"All items were of a certain amount of relevance"
"Too much irrelevant information"

The measure of search time, however, held a low correlation as a measure of Efficiency. That the Efficiency criterion held the strongest correlation with an overall success judgement suggests that users define system efficiency as something other than the time taken to obtain search results. The user-derived reasons for assigning ratings of success on this criterion would indeed seem to suggest that users relate efficiency to the amount of effort required from themselves to conduct a search. For the purpose of this feasibility study this finding has implications for the further development of user measures in the evaluation framework which will be discussed in the conclusions.

"Ease of use"
"Had to redefine search twice"
"The search terms were attempting to pin down a concept that was hard to verbalise/encapsulate"
"Would become 'extremely efficient' as the user becomes more adept with search terminology phrasing and when an 'advanced search' would be more appropriate"
"Very quick only had to search once"
"One search term locate all items that were of some relevance"
"Needed to define search better"
"Minimum effort but results not good"
"Search engine seemed efficient enough, but the search term was unusual. I think with a more concrete search term the SE would have performed well"

The Utility measures all held strong correlations both globally and across the search engines. Our user-derived reasons would also indicate that these were measures which users themselves used in judging system performance. Further analysis would be required to ascertain if in fact all these measures were simply variations of the same measure "satisfaction with results".

"Those items found were useful"
"I have gained further info on the subject I was search for"
"Current and up to date info was located"

The correlations found with the Interaction measures were relatively low with user satisfaction with query presentation holding the strongest correlation. The user derived reasons, however, indicated that there was perhaps some expectation from the users that the system would provide some assistance in modifying the query and that this would impact on their evaluation of system interaction.

"I changed the query once and it was helpful"
"The SE easily allowed the query to be modified"
"I didn't like the style of layout of retrieved item"
"Found it hard to refine search"
"The query was easy to change but yielded no better results"
"Good options to change query"
"Refining the search was hard, I couldn't think of any new queries and the SE didn't offer any help trying to narrow down search queries, like the SystemB SE I usually use"
"Could lead to different routes of enquiry from the initial search term"

Proposition 2 Characteristics of information systems will affect user evaluation on task dimensions

The view taken is that user evaluations are not random, but reflect the characteristics of the system in supporting the users' task. For the purpose of the feasibility study we looked to find evidence that variation across the systems is found in the users' evaluation based on the measures used.

The strongest correlation with users' overall rating and Utility was found on SystemB, with the measures of user satisfaction with results, value of results as a whole, and overall quality of results correlating most strongly with user judgement on this criterion. This would suggest that users' high rating of Utility lead to a high rating of system success. In contrast, a very weak correlation was found on SystemA with users' overall rating and Utility. The marked difference in the strength of correlation found between the systems is interesting but only in that it suggests that users overall judgement of system success may be more strongly associated with a judgement made on a particular task dimension/ criterion depending on the system. In an evaluation study with a far larger sample more insight and interpretation could be possible from an analysis of the central tendency on the rating scales for the individual measures. For example, in our feasibility study using a small sample it was found that 68% of the users rated the utility of the results from SystemC as contributing very little to the resolution of the problem, and 58% of the users expressed dissatisfaction with the results as a whole. This could be compared with the 29% who expressed dissatisfaction with the results from SystemB.

Following this line of analysis for the Interaction dimension we can note that the strongest correlation with users' overall rating and Interaction was found on SystemB, with the measures of user satisfaction with input facility and ability to modify query correlating most strongly with user judgement on this criterion. In the analysis of central tendency it was found that 77% of users rated the ability to modify query as important with SystemB.

Proposition 3 Query characteristics will affect user evaluation of SE

In the framework proposed it is suggested that user evaluation of the system may be moderated by some contextual characterisation of the user and information query. That is, a user context makes different demands on system and thus lead to higher or lower user evaluations of satisfaction with system on dimensions of the information retrieval task.

For the purpose of the feasibility study we sought confirmation or otherwise that the query context will have some moderating effect on the evaluations. To this end we analysed the user/query context where a system received high/low ratings by correlating the four task identifiers (task defined, task purpose, task knowledge and task probability) against the overall satisfaction rating (General Feelings) and the four criteria. Again we stress that our study was not an evaluation of the systems but rather a testing of the feasibility of the framework as an evaluation tool. A greater sample would be required in an evaluation situation to support any analysis at this level.

Globally, across the three engines, moderate strength correlations between task definition and General Feelings (.407), Effectiveness (.418) and Efficiency (.482) were found indicating that as task definition increases so does overall satisfaction, satisfaction with Effectiveness and satisfaction with Efficiency. The correlations between task definition and Utility (.307) and Interaction (.221) were weak. Weak/very weak correlations were obtained between task purpose, task knowledge, and task probability and the overall satisfaction rating and the four criteria.

The suggestion that a system receives a higher rating of effectiveness and efficiency when the user has a well-defined task is not surprising. It would be reasonable to assume that in such a context the information seeker will have a fairly good idea of the search requirements and will work effectively with the system to obtain the results. The effect of the moderating context will be of more interest if notable variations can be found between the systems evaluated.

Again in the limitations of the feasibility study it is noted that the strongest correlations of task defined against Effectiveness (.561) and Efficiency (.702) were found on SystemB. Whilst weak correlations were found globally for task purpose, when based on the data obtained for SystemB moderate strength correlations were found for task purpose with Efficiency (.636) and Utility (.577). The comparison, for example, that on SystemA the correlations were weak, task purpose and Efficiency (.161) and Utility (.002), sets up a line of enquiry as to why a relatively strong correlation was obtained on SystemB. Again it would not be surprising if a correlation was found for task purpose and utility across all three engines. A broad query, open to many avenues, could lead to a high rating of the utility of the results. That such a finding is strongly held only on one engine could lead to speculation that a feature of the engine leads to results which better support a broad query. Further analysis of task purpose association with the Interaction measures revealed that correlations of moderate strength were obtained only with the measures ease of understanding item/s from hitlist and satisfaction with query presentation on SystemB. This may suggest that the features of SystemB which are related to the visual organisation and representation of the search results better support the user with a broad query.

4.5 Discussion and Conclusions

The aim of this project was to develop a framework for the evaluation of Internet Search Engines with an emphasis on a user-centered perspective. The review of search engine developments revealed a range of indexing and retrieval techniques which are employed to assist casual users in their task of retrieving information. In particular a range of novel search engine features or characteristics can be seen in the areas of search assistance (query formulation, modification, and visualisation), and results ranking. In this context, it is critical that we have some means to measure the impact system features have on users' satisfaction with respect to what they want to do or achieve with these systems. The review of approaches for the evaluation of retrieval systems served to highlight the complexity of evaluation studies which aim not only to obtain some objective measure of performance but also some measure of the utility of the retrieved results and the usability of the system from a user perspective. Consideration of this complex evaluation situation led us to the proposal of a conceptual framework for system evaluation in which user satisfaction is characterised as a function of system-task fit expressed in a moderating context of the user requirement. Thus the aim of this project was to explore the feasibility that the framework captures the complex interrelations among system and contextual parameters in such a way so as to provide meaningful user evaluation of the system. Towards this end we focussed our research on the definition of the construct of user satisfaction which in the proposed framework would be taken to be the dependent variable. The key to its use as a meaningful measure of the system and its functionality is the view that user satisfaction is a multidimensional construct, a function of the user's task requirement.

For the feasibility study a framework for evaluation was developed along these lines drawing, in the main, on existing user satisfaction measures and contextual characterisations. The general task model of the retrieval process provided the dimensions of user satisfaction and to an extent allowed us to identify the system components or features which may impact on the users' judgement of satisfaction with respect to task-support or fit. For the further development of user satisfaction measures use will be made of alternative models which give emphasis to the dynamic and interactive nature of the retrieval task. The main objective of the feasibility study was to test the notion that satisfaction is a multidimensional construct and the validity of the measures used.

The analysis of the data collected in the empirical investigation appears to support the notion that user satisfaction is expressed as a multidimensional construct with correlations held among the measures and overall success ratings. To ascertain the validity of the measures used, or to develop new user satisfaction evaluation statements, will require further analysis, in particular of the user-derived reasons for attributing system satisfaction on the dimensions of Efficiency and Interaction.

The conceptual argument underlying the evaluation framework is that user satisfaction, the strength of user evaluations, will be dependent on the system characteristics in supporting the associated task dimension, given the context of task demands and capabilities of the user. In the feasibility study some variation was found in the users' ratings on the criteria and measures across the search engines indicating that the features of the engines may have in some way contributed to users' evaluations of the systems. It is further possible to speculate that system characteristics such as 'selection of items for inclusion in database' may impact on the Utility judgement, and 'facilities for query modification, such as relevance feedback or suggesting terms' may impact on the Interaction judgement. Some impact of the identified query context was also found suggesting its moderating effect on users' evaluations of the systems. Again in the constraints of the feasibility study we could only speculate with great caution that a characteristic of the system may have better supported a particular query context. In a full-scale evaluation study appropriate statistical techniques, such as regression analysis, would be necessary to explore the relationships held among dependent and moderating variables and to express the overall performance measure as a function of these variables. The implementation of such an evaluation framework would require a far greater sample size than the one used for this feasibility study.

Our preliminary findings have revealed the complexity of the construct of user satisfaction as a measure of system performance, but also have indicated to us the potential value of the proposed framework for the evaluation of search engines. We therefore tentatively suggest that with further refinement the proposed framework will provide for a multidimensional user evaluation of search engines and may allow some evaluation of the specific features of search engines from a user perspective. Furthermore, the incorporation of a moderating context in the evaluation may provide a better understanding of the differences found in users' evaluations of the same system. Ultimately our aim would be to develop the evaluation framework so that not only is variation found across systems in users' expression of satisfaction but also that system characteristics can be identified which provide an explanation for variation within the evaluation dimensions and across the users' task contexts. Towards this end, research will continue at the Manchester Metropolitan University to develop a set of user evaluation statements which define the multidimensional construct of user satisfaction by the task dimensions of information retrieval.

[ Previous Section - Chapter Three ] [ Next Section - References ]