TKK | Tietoverkkolaboratorio | Opetus

2.3 Web Robot Program

We have already defined web robot program as a piece of software that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Now we will discuss more precisely how this is done. When a new web robot starts operating it needs a known set of documents where to start (set of documents to be retrieved). Then it repeats following actions:

  1. choose which document to retrieve next
  2. retrieve this document
  3. examine the outbound links and add them to the set of documents to be retrieved
  4. mark the document as having been retrieved and pass it to the indexing program
  5. repeat step 1.

Choosing Document to Be Retrieved

Basicly all web robots use some form of breadth-first algorithm when choosing which server they visit next. There is a very simple reason for this. If they would use a depth-first algorithm, in other words if they would try to retrieve all pages from one Web server at the time, they would probably overload that server. This overloading would either significantly slow down or even cause crash of the server. So it's better to retrieve only few pages from one server at the time, spreading the load among servers and ensuring that every server with useful content has at least some pages represented in the index.

Retrieving a New Document

Retrieving a document is a simple task, exactly the same thing your browser is doing when you read these pages. A html-document is retrieved by sending a HTTP (Hyper Text Transmission Protocol) request to server. As users of World Wide Wait know this may take some time and therefore it creates a search bottleneck. This bottleneck can be avoided by using multiple retrieving processes at the same time.

Examining the Outbound Links

Because html-documents are text documents and all links have fixed structre (in html they are marked as <A HREF="URL">link text</A>), examining the outbound links is quite easy task. There are efficient programming tools for handling text documents like Perl language which can be used for parsing the links from html-document. It is important to remember that these links can also point to non-indexable files, such as picture, sound or postscript files.

[Previous page] [Contents] [Next page]


Tämä sivu on tehty Teletekniikan perusteet -kurssin harjoitustyönä.
Sivua on viimeksi päivitetty 08.12.2000 23:25
URL: http://www.netlab.tkk.fi/opetus/s38118/s00/tyot/28/webrobot.shtml