Search has become gradually more sophisticated since it was first effectively developed by Gerard Salton in the 1960s, and since the early 1990s the area has been rich in innovation. Bayesian technology, developed by UK company Autonomy, and page-ranking, developed by Google, are two
outstanding examples. In both cases, the commercial results reported by users have been dramatic.
Nevertheless, as Bill Gates, chairman and founder of Microsoft, commented recently: “Search is nowhere near where it should be.”
Search, like many technologies, has been the subject of religious wars, with different camps advocating their favoured approach. But a clear new trend is emerging: eclecticism. The more technical approaches that are used, the better it works.
Free text search
All search engines have at their core an engine for recognising strings of letters as words. Many simple embedded search engines today do little more than this.
Enterprise products such as Verity’s Ultraseek (now part of Autonomy), and AltaVista, the first big Internet search engine, are primarily keyword based – as is, in large part, Google. These engines have been made more functional by the addition of more advanced features, such as Boolean search (and/or/not searches), synonym libraries, and the recognition of relevance (usually by measuring frequency and placing of words). Some engines are capable of recognising ‘entities’, such as names or places.
Tags and taxonomies
Free text searching alone, even with Boolean logic and entity recognition, is rarely good enough. Retrieved files can vary wildly in subject and in type of document.
Most commercial enterprise content management systems recognise this shortcoming by using tagging, or metadata. Files are labelled, usually by people but sometimes automatically as a part of a process, so that they can be filtered as fields in a structured database. This increases speed and precision – the results can be grouped in a database, in pages on a website, or in search results.
On the Internet, Yahoo was a pioneer of human categorisation. But as volumes grew, it struggled to keep up and later moved towards automatic ranking and some auto-categorisation.
Similarly, many enterprise customers are wary of categorisation. They can end up with complex ‘taxonomies’ or tagging methodologies that must be closely managed and strictly followed.
Bayesian and contextual search
Before Google became hot property, Mike Lynch, the founder of Autonomy, emerged from a cluster of researchers at Cambridge University applying Bayesian probabilistic theory to search.
Bayesian search engines, such as Autonomy’s IDOL, attempt to extract the “concepts” in documents by analysing them, rather than merely indexing them. This analysis is based on recognising patterns and clusters of words and how they relate to each other. For example, it would gain a ‘sense’ that documents with the words “Bush” and “Saddam” are different from those with “Bush” and “Mulberry”. Many other suppliers, including Autonomy’s bitter rival Fast, now use “Bayesian-like methods”.
In spite of its great success in some areas, some critics believe that Bayesian techniques can produce idiosyncratic results, and that the systems only work well with large volumes of pre-analysed documents and where some system ‘training’ has been possible.
Linking and rankings
In 1998, search technology was revolutionised by Google. Its big idea was to rank the relevance of pages by giving a weighting not just to the frequency or placing of words on a page, but by the number and quality of the links to other pages. The innovation was based on the principle of academic peer review – in science, the best papers are ‘cited’ by reputable scientists in other papers. Link ranking produced dramatic results in search accuracy, and made categorisation and even Bayesian analysis look distinctly unfashionable. Larry Page and Sergey Brin, Google’s founders, filed their patent for this while studying at Stanford University (it expires in 2011). Effectively, the algorithm found a way to move beyond the actual content to capture some of the human activity around a document.
Google’s success in ranking has also spawned more innovation among other search companies. Ask.com, for example, now models clusters or communities of sites that use similar words and tend to link to each other. In this way, it is able to offer searchers similar sites when a user begins to show a preference, says Tony Macklin, Ask.com’s European vice president for product management
Searching the search…
One of the emerging methods for helping searchers to find the information they need more effectively is to ‘search the search’. This involves using some of the information that can be extracted from previous searches around the same area, both by the searcher themselves and by other like-minded people.
At its simplest, this means storing ‘favourites’ more intelligently. Even in the enterprise, the same documents are accessed by members of a team over and over again. And some websites, such as Ask.com, now have a facility to save all previous searches, enabling users to create what is a effectively a mini-database of all the websites they have ever visited via that portal.
Within the enterprise, Autonomy’s enterprise platform IDOL can pool search information, so that searches made by one colleague can help narrow the search made by another – a powerful technique for team workers, in particular.
This is expected to become more popular on the Internet. Search engines can model the searches of like-minded individuals to offer up apparently insightful results – a process not dissimilar to Amazon.com’s “People who read this book also liked…”
system of suggesting alternative purchases to online shoppers.
Whatever automated methods are used to aid search and find meaning, most experts agree that tagging, when done properly, produces definitive, practical results. But the process of creating taxonomies and labelling content is an arduous manual task. Is there a better way?
Two solutions are being tried, so far with mixed results. In the enterprise, one approach is to automatically generate metadata tags from the document. This works best in a formalised, controlled environment, possibly using pre-defined vocabularies. At its Almaden Research Centre in California, IBM has developed a method for automatically tagging documents in a very structured away, based not just on a document’s content, but also the way it is formatted and used.
On the web, another approach is to enable the user to tag the documents themselves – a method used on photos on Yahoo’s popular Flickr site, for example. These tags create so called ‘folksonomies’. But in spite of the appeal, critics do not think web users will be disciplined enough to add metadata to their pages in a consistent, reliable way. To do this, they would need to adopt a ‘collabulary’ – a collaborative vocabulary.
Most researchers agree: it should be possible to ask a computer to find a document by using natural speech. An unproven alternative to Bayesian analysis is semantic analysis: analysing speech to understand meaning. This would give more precise answers to vague questions. Although some commercial products exist, the ambiguity and inconsistencies of natural language presents major barriers. Recently, IBM’s Thomas J Watson Research Center in New York demonstrated a system called ‘Piquant’, based on its Semantic Analysis Workbench. Visitors report that it appeared to understand meaning well – but were unclear about which technologies are actually used.
Today, there is no such thing as perfect search. But in all likelihood, the ultimate search engine will use many techniques, with Bayesian technology, auto-categorisation, semantic analysis and computer generated ‘disambiguation’ all playing key roles.