Web Searching (Part One) from the October 2000 Actrix Newsletter

by Rob Zorn

This month and next, I thought it might be a good idea to present articles on searching the web. The Internet is a vast storehouse of knowledge, and that, perhaps ironically, is what makes it so difficult to use as an information source.

I am sure most of us have gone to a "search engine," typed in a key word or phrase, and then received back a list of 40,000 pages that supposedly match our search. As we start rifling through the returned links, we find that about one in every 20 looks like it might actually have some relevance to our search. Disappointingly, only about one in 20 of those (on a good day) seems actually to be helpful.

For that reason a couple of articles are probably in order. This month I thought I'd look at how search pages work (it's not really accurate to call them all search engines) in terms of how they find information in the first place and then how they store it. An understanding of this will be a good foundation for an article next month on how to approach the different search pages to get them to be more helpful in what they find.

Search Engines vs Directories

One of the most popular "search engines" is not actually a search engine at all. Yahoo is in fact a directory, and as such it relies on humans for the things it lists. People submit a short description of their site to the directory, or editors (independent or working for Yahoo) write one for sites they review. When you search at a directory such as Yahoo, the search software only looks for matches in the descriptions that have been submitted. This, of course, has its advantages and disadvantages.

True search engines, such as AltaVista or HotBot on the other hand, create their listings automatically, and often without human intervention. Search engines send out little things called robots or spiders which crawl through the world wide web, returning what they have found to the search engine's indexes where it is sorted, either by computers or by real people. The spider or robot also follows links to other pages within the site. This is what is meant when someone refers to a site as having been "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes.

The index, sometimes called the catalogue, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with the new information, though this is usually quite a lengthy process.

Search software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search. Its second job is to rank its findings in the order of what it believes is most relevant.

All search engines have the basic parts described above, but there are differences in how these parts are tuned. That is why the same search on different search engines often produces very different results.

How Search Engines Rank Web Pages

Imagine walking up to a librarian and saying, "Dylan." A bad librarian would just scowl. A good librarian would start asking you some narrowing questions such as, "Do you mean Bob Dylan, the 20th century's greatest singer/songwriter? Do you mean the Welsh poet Dylan Thomas? Perhaps you want to know about the Dylan computer language..."

Unlike a librarian, search engines don't have the ability to ask a few questions to focus the search. They also can't rely on judgement and past experience to rank web pages, in the way humans can. To determine relevancy, search engines follow a set of rules that mainly have to do with the location and frequency of keywords on a web page. Pages with your entered keywords appearing in the title are assumed to be more relevant than others to the topic.

Search engines will also check to see if the keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning. Frequency is the other major factor. A search engine will analyse how often keywords appear in relation to other words in a web page. Those with a higher frequency are assumed by the search engine to be more relevant to your needs than other web pages, and are therefore ranked more highly in the returned search results. All the major search engines follow this location/frequency method to some degree. None do it exactly the same, however, which is another reason why the same search on different search engines produces different results.

To begin with, some search engines index more web pages than others. Some search engines also index web pages more often than others. What this means is that no search engine has the exact same collection of web pages to search through.

Search engines may also boost the ranking of a web page for reasons of its own. For example, Excite uses link popularity as part of its ranking method. It can tell which of the pages in its index have a lot of links pointing to them. These pages are deemed as being a little more important, since a page with many links to it is probably well-regarded on the Internet.

Meta tags are also important. These are tags hidden within a page's html code that contain a description of the site, or a list of the site's keywords according to the designer. Here, for example are the relevant meta tags from http://editor.actrix.co.nz.

<meta name="keywords" content="Actrix, Actrix Networks, Newsletters, Actrix Newsletters, New Zealand ISP, Rob Zorn, Norrie the Nerd, Norrie, Editor">

<meta name="description" content="Actrix Newsletters are written and published in an effort to keep Actrix customers informed about the Internet. We aim to make the newsletters reasonably short and easy to read with a good balance of material including ideas and tips on how you get the most out of the Internet, news about technical developments and information about our services.">

These tags are put there specifically for the robots or spiders to find. The keywords are designed to alert a search engine's attention to what keywords our site might be relevant to. The description is what we'd like to appear for a "searcher" in the summary when their search engine

Many web designers mistakenly assume that meta tags are the real secret to boosting their web pages rankings. HotBot and Infoseek do give a slight boost to pages with keywords in their meta tags, but some search engines, such as Lycos, don't read them at all, and there are plenty of examples where pages without meta tags still get highly ranked.

Search engines may also penalise pages or exclude them altogether if they detect search engine spamming. An example is when a word is repeated hundreds of times on a page (usually hidden in the html), to increase the frequency and propel the page higher in the listings. Search engines watch for common spamming methods such as these, and no longer fall for them. I've included a short article below which includes a few tips for designers that would like their pages to rank better.

Admittedly, so far we haven't covered much in the way of improving search page search effectiveness. I will save that for next month. Hopefully, though I've provided some background knowledge on which to build, and that can only be a good thing.

In the meantime, here are some links to the more popular search pages.

Ask Jeeves AltaVista Excite Go
Google HotBot Lycos Metacrawler
Northern Light      WebCrawler      Yahoo