Web Search Engines - Can you find anything with them ?

2.1 A Brief Review: the History of Search Engines

When the World Wide Web was growing during the first years of 1990s users were confronted with a serious problem. Finding information among millions of documents was becoming almost impossible task. It became evident that something had to be done and that something was indexing the entire contents of the World Wide Web.

Gary Taubes described these early days of search engines in his article in Science:

"One of the first attempts to solve the problem relied on anyone who posted a new home page to list the service in a central index-the "Mother of All Bulletin Boards," as its developers at the University of Colorado, led by Oliver MacBryan, called it. Users of the bulletin board could then search it for whatever subjects interested them. For the Mother of All Bulletin Boards to work, however, everyone posting documents on the Web had to know about it-and use it correctly. And even then the result would be a list rather than a subject index. "It was a good notion in principle," says Paul Ginsparg, a Los Alamos National Laboratory physicist who created an electronic preprint archive. "But it didn't really solve the problem." MacBryan was also among the first to try another solution, one that didn't require users to take the initiative: send out a search and retrieval program, which he called the World Wide Web Worm, or WWWW "It's the simplest thing you can do," says University of Colorado computer scientist Michael Schwartz: "Write a piece of software that reaches out across the network and retrieves as many Web pages as it can find and follows links in those pages to find other Web pages." At each page, the program records the address, known as the uniform resource locater (URL), and downloads part of the contents for indexing in a searchable database."

In principle, it is simple to retrieve as many Web pages as one can find and follow links in those pages to find other pages. In fact, current search engines are still based on this principle. However, if the target is to index entire contents of the Web, there are some things to be considered. First, the size of the Web. It isn't easy to estimate the exact size of the Web, but according to Steve Steinberg's article in Wired it was in May 1996 30-50 million pages. It is clear that crawling through that huge amount of pages requires quite efficient hardware and software. From Steinberg's article we can also find description of hardware used at the time:

"What happens when everyone in the whole world is connected to the Web, and half of them are trying to use Inktomi at the same time? Perhaps computational power will be the bottleneck. Definitely not a problem, Brewer insisted unflaggingly. Inktomi has been stress tested at more than 2.5 million queries a day with no difficulty - and that's with just four outdated workstations. Hook together 40 state of-the-art computers and Inktomi should be able to handle 100 million queries a day - easy. Sure, the Web is growing exponentially, but microprocessors are on that same curve. Computational power is the least of our worries."

Eric Brewer stated in May 1996 that the Web would grow exponentially, but there would be no problems with computational power. Let's take a bit more recent estimate and see how these forecasts held up. Lawrence and Giles wrote in their article in Nature, that as of February 1999, the publicly indexable Web contained about 800 million pages. Indeed, the growth has been exponential. What about computational power, would four workstations still do ? Let's take a look at Google search site's hardware.

"Google runs on a unique combination of advanced hardware and software. The speed you experience can be attributed in part to the efficiency of our search algorithm and partly to the thousands of low cost PC's we've networked together to create a superfast search engine."

"To provide the net's best search results, Google operates what is probably the world's largest Linux cluster that puts many supercomputing centers to shame. For example, the current cluster contains 800 TB of disk storage and has an aggregate I/O bandwidth of about 150 GB/sec (that's bytes, not bits)."

Quite impressive figures and they show that search sites must invest heavily to the hardware if they want to cope with the exponential growth.

Hopefully these short fragments from the history of search engines have outlined the main challenges search engines are confronted with. Next we will discuss how search engines in general crawl through and index hundreds of millions pages.

Those who are interested in more thorough coverage of the history of search engines can read for instance the first chapter of Guide to Search Engines by Wes Sonnenreich and Tim Macinta.

[Previous page] [Contents] [Next page]

Tämä sivu on tehty Teletekniikan perusteet -kurssin harjoitustyönä.
Sivua on viimeksi päivitetty 08.12.2000 23:46
URL: http://www.netlab.tkk.fi/opetus/s38118/s00/tyot/28/history.shtml