Launched in 1998, Google’s stated its mission: “to organize the world’s information and make it universally accessible and useful.” And so it is. Today, everyone Googles – in the U.S, about 12 billion times a month (including search engines that aren’t Google). We are mostly pleased with the results we get. How can it be that we give an automated system a couple of words and it finds reasonably relevant documents among one hundred billion or so possibilities? Will our satisfaction with these tools increase or decrease as the Web and our expectations grow?
At the March 4 Lunch ‘n Learn seminar, Computer Science Professor Andrea LaPaugh gave a peek “under the hood” of major search engines. Core techniques range from word occurrence analysis for text documents, which originating in the 1960s, to Web linking analysis, pioneered by Google’s 1998 PageRank document ranking method.
It’s possible to score the relevance of documents by taking into account their length, by tabulating the frequency of words and phrases, and by giving greater weight to key words and phrases. The appearance of search terms in a section header, in bolded text or italics, within abstracts, or perhaps even within the URL itself can enhance the score in specific queries.
Before you initiate a search, retrieval systems have already recorded all information about web documents in an index that records a list, for each word, of all documents in which the word appears, the positions of the word in each document, and the various attributes that define the use of each occurrence. Indexes do not usually record pairs of words. Such indices would become prohibitively large. But by storing the location of each word, it becomes easy to determine if a searched for phrase, even a lengthy phrase, exists within a document.
PageRank is the algorithm that gave Google its killer performance. The technique uses the idea of random walks, essentially following links to and from web pages. As with scholarly literature, every link reference confers some legitimacy to the linked page. Every web page gets a stable PageRank score that reflects not only word frequency analysis, but also the number of links from and to the page and the probability of being at any given web page.
The CASS Project, within the Princeton Computer Science department, is one of many efforts to investigate retrieval of images based upon features within the images. Led by Kai Li, Moses Charikar, Perry Cook, Olga Troyanskaya, and Jennifer Rexford, the project permits you to click on a picture, for example, and generate similar pictures from its database.
LaPaugh predicted the emergence of a richer Web that could proffer topics without word clues, respond to natural language queries, and return information in ways that are more useful to us individually.
On February 22, 2009, the New York Times ran an article Exploring a “Deep Web” that Google Can’t Grasp. The point, she emphasize, is that much of the information on the web related to financial information, shopping sites, and even much academic research lies behind databases that must be directly queried rather than universally indexed. New technologies are emerging to extend the reach of search engines into these database collections.
The emerging web may use additional clues such as personal or aggregate behavior, essentially how users have interacted with pages, clustering and sampling methods, and individual link analysis. Clusty offers results based on clusters. A search on the word “bat” will cluster results around different topics such as baseball, mammals, images, and conservation. Another site, Cuil, provides useful summaries of sites.
Other research areas being pursued include machine learning to assign more appropriate catalog tags, network analysis of the web’s structure, semantic analysis of Web pages, sampling methods for image and video analysis, and architectures for large data collections.
Andrea LaPaugh is a Professor of Computer Science. Her research is in the development and evaluation of methods for searching and analyzing information. Much of Professor LaPaugh’s work is based on the principles of combinatorial algorithm design. In addition to her work in application areas, she has developed and analyzed algorithms for theoretical combinatorial problems such as graph structure problems. Professor LaPaugh has worked extensively in the development of algorithms for problems in digital design. A major area of research has been VLSI circuit layout: investigating the interactions between placement and detailed routing. She has developed algorithms that use these interactions to find better placements for circuit components.She is currently teaching COS 435: Information Retrieval, Discovery, and Delivery.
A podcast and the presentation are available.