Friday, August 12, 2005

Short list of search engines

There are two preliminary criteria to be satisfied by webometrics friendly search engines:

1. To have a large and independent self crawled database
2. The recovery system will allow the filtering of results according to url-related delimiters

Taking into account these requirements, currently only six engines are useful for quantitative analysis purposes:

- Google (www.google.com)
- Yahoo Search (search.yahoo.com)
- MSN Search (search.msn.com)
- Teoma (www.teoma.com)
- Gigablast (www.gigablast.com)
- Exalead (www.exalead.com)

Google has a database size probably exceeding 11 bn pages, including a good coverage of the so-called rich files and other dynamic and special filetypes. On the negative side, the number of dead-links has increased a lot in this engine. Individual records of large databases (e.g.: PubMed) are also indexed, but without covering the full size strangely.

Current figures for Yahoo are not available as its databases have increased greatly during July. An educated guess is around the 10-12 bn mark. Actual figures provided by the answers to a large number of request are misleading as the number provided decreases when exploring further pages than first answers' one.

MSN Search looks to have the most comprehensive geographical coverage, including Asian regions usually not well indexed by the other major engines.

Gigablast and Exalead still have very small databases. However, as the overlap among engines is so low, combined search using several engines is clearly an option.