What shape is the Web

& why does it matter?

A study undertaken by researchers at AltaVista, Compaq and IBM has effectively overturned previously held views that there is a high degree of connectivity between all web sites.

Andrei Broder AltaVista's vice president of research and lead author of the study said, "we realised that something is going on, that the model where everyone is linked to everyone else really doesn't quite work. The old picture where no matter where you start, very soon you'll reach the entire Webis not quite right."

The Web's topography has been charted by analysing 1.5 billion links covering more than 200 million web pages. Unlike previous models portraying the Web as clusters of sites forming a well-connected sphere, the results showed that the Web's structure more closely resembles a bow tie consisting of three major regions (a knot and two bows), and a fourth, smaller region of pages that are "disconnected" from the basic bow-tie structure.

At the center of the bow tie is the knot, termed the "strongly connected core." This core is the heart of the Web. Pages within the core are strongly connected by extensive cross-linking among and between themselves. Links on core pages enable Web surfers to travel relatively easily from page to page within the core. They are also the links most likely to be followed by search engine indexing spiders.

The left bow consists of "origination" pages that eventually allow users to reach the core, but that cannot themselves be reached from it. Origination pages are typically new or obscure Web pages that haven't yet attracted interest from the Web community (they have no links pointing to them from pages within the core) or are linked to only from other origination pages. Relatively closed community sites such as Geocities and Tripod are rife with origination pages that often link with one another but pages within the core seldom link with them.

The right bow consists of "termination" pages that can be accessed via links from the core but that do not link back into the core. "Many commercial sites don't point anywhere except to themselves," said Broder. Instead, corporate sites exist to provide information, sell goods or services, and otherwise serve as destinations in themselves, and there is little motivation to have them link back to the core of the Web.

The final region contains disconnected pages that aren't part of the bow tie. These pages can be connected to origination and/or termination pages but are not directly accessible to or from the connected core.

A suprising finding is the size of each region. The core as might be expected is the largest but accounts for only about one third. The two bows are similar in size at about 25% (each) and the balance, 20% reflects the proportion of disconnected pages.

The above has implications for all sites. If a site is part of the core then it is easier to find and more likely to be visited both by search engine spiders and potential Customers. The potential penalty for not being part of the core is lower visibility. Of course you can request search engines to index your site (even if not connected to the core) but the growth rate of the Web is such that this may take some time. Further being indexed is not of itself enough e.g. if the site appears on page 10 of an AltaVista search result list. You can in addition spend significant sums on advertising to create awareness of the site's domain name. The bottom line however is that for maximum effect a site needs to be part of the core. This begs the question "I can link to the core but how do I get the core to link to me?"

One method is to wait for your site to become so well known that "core" sites want to link to it! (but do not hold your breath).The better if simplistic answer is some form of partnership arrangement with a "core" site. The strategy is simple enough but how do you identify the right "core" sites with which to partner? The answer, in one word - "research"

Adapted from an article by Chris Sherman, President of Searchwise.net

June 2000