This site's design is only visible in a graphical browser that supports web standards, but its content is accessible to any browser or Internet device.

mmu | cerlim

DEvISE > Final Report: Internet Search Engines
or use the sitemap

Chapter 2. Internet Search Engines: Factors which affect performance

The categorisation of search engines components and features which may impact on performance follows a logical sequence considering the database collection, the index, and the user/system search features. The tables are derived from more comprehensive reviews of search engines found in Su & Chen, 1999; Notess, 2000; Feldman, 1998; and Sullivan's searchenginewatch. We focus only on four search engines, AltaVista, Excite, HotBot (Inktomi), and NorthernLight to provide an indication of the variation found. We acknowledge that these illustrations may have some inaccuracies given the changing state of search engines, however the tabulation of features is not intended to evaluate or compare rather to highlight the characterisation of engines by features which provides clues for the development of an evaluation methodology.

2.1 Search Service Coverage

While it is possible to submit a web page to a search service for inclusion in its database, most services will also acquire database information from web pages through the use of agents or robots. Sullivan's table shows a number of factors which may vary across the strategies used by robots for crawling. Depth of crawling refers to strategies used for following inter-document links - some will follow all/ some will sample. The use of frames, and image-maps if not supported by the engine will impede progress in crawling the web. Learn frequency and instant index refer to strategies used to update the database with new or changed information. Some 'learn frequency' to re-examine sites which change frequently. Instant index refers to the time delay in which trawled pages appear on the index. While the process of selecting and/or reviewing quality content is generally reserved for subject-specialised search services, some query based services also attempt to reduce the size of the database by establishing subsets of reviewed resources or most popular ones. Link popularity when used to determine pages included in the index establishes the popularity of a page through analysis of the number of links there are to it from other pages.

Table 2 Search engine coverage

Coverage AltaVista Excite HotBot (Inktomi) NorthernLight
Estimated size 30m 50m    
Deep Crawl yes no yes yes
Frames support yes no no yes
Image Maps yes

no

no yes
Learns Frequency yes no no no
Instant Index yes no no no
Link popularity no no yes  
Coverage(content) www www & reviewed sites www www & special collections/journal articles

 

2.2 Indexing strategies

The list of indexed elements in the representation varies from service to service. The majority will index every word on the page, others index only frequently occurring words, or words occurring within certain mark-up tags, or only the first x number of words or lines of HTML files. Stopwords may of may not be applied, and, if applied, may include words of very high frequency such as "web". The use of metatags, traditionally used to improve a search by providing a common ground of indexing terminology, is seemingly discarded by search engines. Web site developers have reportedly mis-used metatags, for example repeating terms many times, in the attempt to have a page appear in the top 10 retrieved. HotBot (Inktomi) reportedly enhances its index with human intellectual representations of items. Some services offer a combination of catalogs (selected collections described and classified into a taxonomy) and large full-text collections. These vary in the extent of human involvement for their creation and maintenance, and the way in which the alternative search modes are offered to the user.

Table 3 Search engines indexed elements

Indexing AltaVista Excite HotBot (Inktomi) NorthernLight
Full text Yes Yes Yes Yes
Stopwords omitted/not searched Yes Yes Yes no
Meta descriptions Yes Yes    
Meta keywords Yes      
Comments     Yes  
Subject Categories       Uses people to create categories

 

2.3 Search features (user control of search)

The graphical user interface,GUI, of a search engine provides system designers with a mechanism whereby the control for interaction is placed with either the system or the user. AltaVista, for example provides the options for simple querying, advanced query, and predefined category browsing. In the opening screen (typically the simple query mode) most search engine interfaces focus on supporting the users' information seeking activities of query formulation and results display, in albeit a somewhat limited fashion. Typically the user is presented with an input box and possibly some guidance as to how to enforce the processing of the query terms (match all/ match any/ treat as exact phrase/ include or exclude a term). Although the interface for simple query appears straightforward to use (enter keywords, click submit, receive hundreds of results), beginner or casual users may find it difficult to use because of unfamiliarity with methods for narrowing search terms to retrieve a manageable number of hits to examine. The typical array of more advanced search capabilities are shown in Table 4. The use of these, for example boolean, to specify query term relationships and truncation or case sensitivity to facilitate the interpretation of a term, assume considerable experience on behalf of the user with some guidance offered in the help files.

Table 4 Search engine search features

Search AltaVista Excite HotBot (Inktomi) NorthernLight
Boolean search Yes Yes Yes Yes
Nested parenthesis Yes Yes Yes Yes
Include/exclude (+ -) Yes Yes Yes Yes
Default OR OR AND AND
Proximity/near/adjacency searching within 10 words concept search approximates this no Relevance ranking gives boost for nearness
Phrase search Yes Yes Yes Yes
Stemming/truncation (permit or inhibit automatic stemming, or specify truncation at the terminal) Yes No No Automatic search for plural and singular word forms
Case sensitivity (wholly, partially) Yes No for a person search Will boost rank if capitals in results when used in query
Fielded search (e.g based on title text, site, url, link, host, domain, anchor, image) Yes No Yes Yes
Limit restrictions (e.g. based on date, language, subject, document type, industry, domain, etc) Yes Yes Yes Yes
2.3.1 Users & usage of search features

Search engines offer an array of search features found in traditional online services. Yet whilst many of these features give a trained search intermediary optimal search performance, search engine users are likely to range from experts to casuals (Travis, 1998). Wiggins and Matthews (1998) in summarising the themes of the 1998 Infonortics conference highlighted the consensus which was the driving force behind many of the developments reported. Professional searchers may be adept at using Boolean to refine searches but novice users are likely to become perplexed and frustrated. Thus it makes sense that on most search engines users are offered statistical based searches first. These are designed to act on natural language descriptions of an information need and to return a list of approximate matches as well as precise matches with ranking taking care of the potential overload of often long lists of near hits. However, whilst use of the retrieval models offered by these statistically based ranking algorithms is touted for end or casual users their effective implementation makes considerable demands seemingly beyond the average user.

Surveys of web usage give some sense of what the average web searcher is doing and point to differences between web searches and queries with traditional IR systems. Observation of average web searcher (Spink et al, 1998; Ellis et al., 1998) point out that their ineffective use may be owing to the little understanding most users have as to how a search engine interprets a query. Few are aware when a search service defaults to AND or OR, and expect a search engine to automatically discriminate between single terms and phrases. Further, devices such as relevance feedback, seemingly conducive to end-user searching, works well if the user ranks ten or more items, when in reality users will only rank one or two items for feedback (Croft, 1995). Most significant is the finding from a study which looked at one million queries put to Excite that users will enter one or two search terms rather than a full informative summary of the information query (Jansen and Spink, 2000). This is possibly due to difficulty in selecting terms arising from the way in which users are reported to conduct a search. Koll (1993) explains that users provide few clues as to what they want as many users approach a search not knowing exactly what it is they are looking for. In adopting the - I'll know it when I see it, or the unknown needle in a haystack - approach to information seeking, users cannot be expected to formulate a precise query.

Larsen (1997) is of the opinion that current Internet search systems are prototypes and that their development will not focus solely on the refinement of IR techniques to zero in on the perfect retrieval set. Rather alternative techniques will evolve to meet the behaviour of average web searchers. In recent years of their development there has been a notable shift towards the introduction of search features which appear to respond to the ways in which users actually search with these systems. Beyond the level of mere statistical keyword matching developments utilise a variety of technology features to help users get the information they want, even if it is not what they asked for. Such developments center on the areas of search assistance or query formulation with subsequent user control in modifying the query and navigation of the results. The notion that improved interaction may be the key to obtaining better results is attractive in principle but diluted by a cautionary observation from Nick Lethaby of Verity Inc paraphrased in Andrews (1996) that "users don't want to interact with a search engine much beyond keying in a few words and letting it set out results" (p.42). Thus in the context of categorising the development of search features we distinguish those which provide searcher assistance and those which shift the control back to the system to provide the most likely relevant hits.

2.4 Search features (system control of query)

As it can be assumed that most users do not use advanced search features, or enter complex queries, or indeed want much to do with searching or interaction, search engines are trying to automate query formulation. That is shifting the burden of coming up with precise or extensive terminology from the user to the system. Some tweaking in this general direction has already been shown in Table 2.3 for example where NorthernLight will boost the ranking of retrieved items containing capitalised query terms. More elaborate are the notions of concept searching, the use of site popularity to improve the relevance ranking of results, and the creation of directories to help the user browse more productively.

2.4.1 Query expansion

Help in improving a user's query formulation may be provided by use of concept searching. The assumption here is that users will take a quick and simple approach to putting a query to a search engine and that automatic expansion of the query will improve the search expression. On a deeper level, concept processing of a search statement is to determine the probable intent of a search (e.g., Excite's ICE technology)

Automatic query expansion uses a system-generated thesaurus, more accurately described as a list of words statistically related by frequency of co-occurrence in documents. Thus a search engine may modify a query by adding those terms with a strong association or high coincidence in documents containing the initial query term(s). This often results in high recall typical of a thesaurus-based system and, since precision can be adversely affected, the search may be subsequently refined by allowing the user to select relevant items for a reiteration of the search. Excite's ICE technology (1999) reportedly works at a deeper level applying concept processing to determine the probable intent of the query. Whilst detailed operation of the technology is confidential, some clue to its working is found in a comparison to Latent Semantic Indexing which analyses, by correlations of related terms, separable contents (or concepts) of a document. Probability theory may also be employed in concept processing to look at ideas contained in text as the outcome of probabilities derived from the clustering of certain symbols. For example, if the symbol 'bar' clusters near certain other symbols in a passage, such as 'drink' 'bottles', then it is likely to refer to a room containing a counter across which refreshments are served rather than rod, a place at which a prisoner stands, or a European sea-fish. Furthermore if these clusters of symbols are present in a text, there is a good chance that it is about the said concept even if the word 'bar' is not actually present. As far as the user should be concerned, the outcome of such processing is that relevant items may be retrieved even if they fail to contain the original keywords of the search statement. This is quite a significant advance on keyword matching when one considers the various ways in which an information query may be expressed, each as likely as each other, but which often result in little or no overlap in the results obtained when put to the same search engine.

2.4.2 Query modification

Providing more user control during query re-iteration and re-formulation, Excite's search wizard and AltaVista's refine function present to the user suggested search terms which frequently occur in the items retrieved. Infoseek's automatic categorisation of documents by topics is likewise offered as a browsable suggestion of topics likely to be relevant to a given search. All of which may assist the user in narrowing a search and provide more precision in the search results.

Another technique in providing user control in the process of query modification is the relevance feedback option (e.g., 'More like this'). This is where conventional querying and browsing strategies have been integrated to allow users to specify a particular document and then browse from that document in order to build a request model. This results in an iterative process consisting of query modification and feedback placing a user in control of the interaction. The basic principle being that users control subsequent queries by assessing the relevance of documents which are then used to modify subsequent query formulation. The query may be reinstated using high frequency terms from identified relevant documents or the entire contents of the specified document may be used as the search parameters to locate similar documents. Again, as far as the user is concerned such a search function assists in the specification of the query at an appropriate level without placing too much burden on the user coming up with the terminology to be used. To an extent the searcher is assisted in transforming a perceived information need into a search formulation within the vocabulary and command constraints of the system.

2.4.3 Query visualisation

Where some form of automatic categorisation of documents by search engines takes place an additional functionality may be offered in the form of the visualisation of multidimensional information about search results. That is the creation of on-the-fly groupings of search results can aid browsing of the different themes or concepts within the search results. Such organisation of results into categories reduces the potential overload in the retrieval of 100s or 1000s of items and assists the users in judging the relevancy of the retrieved items. It also has a useful side effect of highlighting to the user the potential ambiguity of the original search terms (as has been noted, users often fail to provide the important contextual information of a query) and thus can be viewed as a query assistance. Excite's ICE technology recognises clusters of documents and from this can base the grouping of the search results. Most elegant is NorthernLight's dynamic custom folders (Zorn et al., 1999) based on their categorisation of documents in which documents are mapped to a classification system and tagged accordingly. Custom folders based on the search results set provides the user with a hierarchical overview of the major topics retrieved allowing the drilling down from the broad to the specific aiding the browsing of different themes or concepts within search results.

2.4.4 Popular queries

Search assistance can thus be provided in the form of query expansion, query modification or visualisations of the major topics resulting from the query. These all work towards the general improvement of a typical search in which the user submits a couple of keywords, a strategy which eludes the capture of important contextual information of the need and specification of relationships among query terms. Most traditional information retrieval techniques rarely deal with a further complexity in the way in which humans are accustomed to conveying the meaning of or understanding discourse. Much of what we convey is in what is not said (as is what is said) when assumed by the context in which the query is stated. A user who enters the term 'penguin' to a search engine is probably searching for information on the bird rather than information on penguin books or the US rugby club. Similarly the user who enters the broad term 'travel' is probably looking for good travel reviews or pricing information on holidays, and would be less interested in the technical details of Stevenson's Rocket. Using a bayesian (probabilistic) approach to retrieval where knowledge of past events can be used to predict outcomes, prior knowledge of what users are searching for can be factored into the retrieval strategies of search engines.

AltaVista's "Ask AltaVista" is a version of the AskJeeves service. AskJeeves works on a large human generated database of questions based on what people actually search for. When a broad term is entered AskJeeves suggests a set of questions which the user may have intended or suggests a set of alternative, more specific queries. A more specific variation of this is AltaVista's real names link which will direct a user to official sites when a brand name search is conducted. HotBot's related searches offers searches which are similar, either more general or more specific, to a given query. Excite's target results responds to certain types of popular queries with targeted information at the top of its results pages. For example a search on a geographical location such as "New York" will offer first its list of pre-programmed results or custom information including a city map, tourism resources, current weather etc. In a sense the search engine infers that this is the type of information the user is most likely to be searching for when entering a general query.

Table 5 Search engine search features (system control)

Search features(system control) AltaVista Excite HotBot (Inktomi) NorthernLight
Query expansion   Concept search   Concept processing?
Query modification Refine (suggest terms)

Search Wizard (suggest terms)

More like this (browse feedback)

   
Query Visualisation   Cluster/group search results   Custom folders
Popular queries

RealNames
Related searches
Ask altavista

Related searchesTarget results Related searches  

 

2.5 Results display

Once a search is completed, display and browsing capabilities can help a user to determine which items are of interest. Most search engines will present the retrieved items 10 to a page in a default format showing at least the title and some text. Format displays can usually be changed with options such as: Sort by Date, Clustering by site/sort by URL (to identify pages from the same site and thus preventing any one site from dominating the results). The summary may vary in size and preparation, e.g., some are pre-prepared, automatically constructed, using text extracted from heading tags, first x words of text, or most frequent words. Where search terms are highlighted in the text, the user may gain some indication of why an item was retrieved and whether the context of the retrieved record matches the information need.

Table 6 Search engines results display

Display AltaVista Excite HotBot (Inktomi) NorthernLight
Sort by options No Yes no, but offers clustering Yes
Results at time 10 10 10 10
Title size 78 70 80 80
Summary size 150 395 170-250 150-200
Metatags description Yes Yes Yes No
Highlight search terms        
2.5.1 Ranking

In terms of judging the results list Courtois and Barry (1999) argue that users are most likely to scan their results list and retrieve only selected items. However Cullis (in Sullivan, 1998) found that only 7% of users really go beyond the first three pages of results. Sullivan goes further saying "most users will find a result they like in the top ten. Being listed 11 or beyond means that many people may miss your web site " (2000). This suggests that users are rarely interested in a comprehensive, high recall search, but rather are satisfied with the retrieval of a couple of relevant hits.

Courtois and Barry (1999) point out the popularity of search engines is due in part from the perceived ease of use caused by their use of ranked output. The results and their relevancy to a given query are usually ranked by statistical term frequency, location, and possibly proximity of terms in the documents Simply put, a page which makes frequent mention of terms will get a higher rank than a page with only one reference. Similarly, a page with the search term in its title will be considered more relevant than others. How these criteria are applied defines the ranking algorithm and varies among search engines.

Hotbot describes term frequency and location as primary factors (Sullivan 1999a). Documents with more occurrences of the search term receive a higher weight, but the overall obscurity of the term within the database also has an impact. In addition, the number of occurrences relative to the document length is considered and shorter documents are ranked higher than longer documents with the same number of occurrences. Terms in the title or metatags are weighted higher than terms only within the text. AltaVista considers these factors, as well as the number of terms matched and the proximity of the search terms (AV Search: question 1999). Others provide less information. However Sullivan (1999b) reports that Excite does index terms in metatags, and retrieves documents by analysis of the document content for related phrases in a process it calls Intelligent Concept Extraction (Excite, 1999).

These methods for ranking output on predicted relevance have been experimented with for decades, but are limited to relevance based on topic alone. Barry and Schamber (1998) list at least a dozen further indicators which may determine the relevance of an item to a given user, including factors such as novelty, source characteristics, and availability. Given the utility of ranking, from a user point of view, in minimising the effort in finding an item, search engines have adopted a variety of experimental approaches using off-the-page parameters to boost the ranking of an item.

Link popularity boosts the ranking of a site if it is deemed to be popular based on the frequency with which other documents link to it. Generally speaking, counting links will set those with most pointing to it higher in the ranking. However, in practice the technology may be more complex whereby, for example, a link from a reviewed site or one with a good reputation will carry more weight in the overall analysis. Search engines using link popularity, such as Google, can be said to automatically capitalise on the human endorsements of web pages made by site authors when linking or pointing to what is in a sense recommended sites. A variation of this use of collective judgements is the use that can be made of the search behaviour of millions of web users in ranking popular sites. Direct Hit is a company which works with search engines (e.g. HotBot) and monitors user clicks on search results (what pages they visit). Over time, a measure is obtained on the popularity of sites - those which are visited more than others rise higher in the popularity rankings. To use this information in a search engine, the user may be offered the Direct Hit option on a page of search results. This will bring up the list of search hits ranked to be popular by Direct Hit. For example, in HotBot Direct Hit results are displayed under the heading "Web matches: top 10". This is usually available only when a popular query is entered, and is usually most effective for one or two word queries looking for information on, for example, a famous person or a particular site. As a result the ranking of the results delivered by the Inktomi engine begin on the second page of ranked results.

Reviewed status gives pages a boost if a site is listed in an associated directory or forms part of the "reviewed" content provided by the search service. Meta-tags gives boost if a search term appears in a metatag.

Table 7 Search engines ranking boost

Display AltaVista Excite HotBot (Inktomi) NorthernLight
Link popularity Yes Yes Yes Yes
Direct Hit No No Yes No
Reviewed status No No No No
Meta-tags No No Yes No

 

2.6 Chapter Summary

The review has presented a very broad categorisation of search engine components to show the extent of variation of features offered by individual search engines which may impact on their performance. Any combination of which may lead to a more effective search, and thus improved performance and ultimately user satisfaction with the retrieved results. In the context in which search engines operate (notably casual users) there has been an increasing trend to provide a range of search assistance features. Such that it could be argued, as in our introduction, search engine developers are targeting a niche, a type of user and/or information query. Future development is uncertain. Trends can be identified, such as automatic categorisation, information visualisation, and the use of bibliometrics on the web. The former may assist a user in understanding content of large collections or search results, the latter used to recommend documents by analysis of citation paths or hyperlink paths. It would appear that the shift towards supporting a user in their information seeking task, possibly to the extent of providing the information even if it was not requested, will continue to drive the advancements in techniques and technologies.

The problem faced by designers is that given the wide range of potential users little is known as to what users want, and how they might use these systems. Critically it is not known how users are satisfied and what impact these more novel features might have on search satisfaction. Thus, it is towards this end that we develop a framework for evaluation which asks how users are satisfied (e.g., whether it be on the retrieved results alone, and/or on their interaction with the system and assistance provided). Further, the framework incorporates a given spectrum of information needs and user types so that we can begin to understand the moderating effect of context on user-system satisfaction.

[ Previous Section - Chapter One ] [ Next Section - Chapter Three ]