Digital Minefield

Why The Machines Are Winning

The Elements of Search, Part Three


We use search engines to search the Internet. We do, but only indirectly. It wouldn’t be practical otherwise. Do we build our own airplanes to fly from New York to Los Angeles? No, we buy a ticket on a scheduled airline. Like regular air travel, search engines help us get where we want to go, but the destinations number in the billions and change every second.

Search engines are more like a library’s catalog. When we search there, we’re not searching the actual library. So, too, when we use a search engine; i.e., we’re searching its “catalog” (or cache) of the Internet. Library catalogs are easy to search because they have an organized structure. Same for a search engine’s cache—making it quick to search.

What’s in this cache? It’s different for every search engine, but that shouldn’t really concern us. What we should care about is how complete, accurate, and up-to-date the information is. As you read this post we can assume it can be found by searching the Internet. However, the instant it was posted, it was still waiting to be found and placed into a search engine’s cache.

How does a Web page on the Internet get into a cache? For one thing, search engines are continually searching the Internet. However, even at the speed of light, this is not instantaneous. And the Internet is really, really big. And it’s continually changing.

How big? Searching for a single ubiquitous target (like the letter “a” or “e”) yields about 25 billion hits. But those are just the pages with text. Many pages have no text. Some are just pictures, and some are video, and some music, and so on. Clearly, the number of pages on the Internet (currently) is closer to 50 billion than 25.

These numbers change every day. Recent figures show the Web gains over a 100 thousand Web sites a day. And loses nearly as many. The net (pun intended) gain is about 30 thousand. But those are just sites and a site could have only a few pages. Or, like this blog server, many, many millions.

At any given instant, what percentage of Internet Web pages are stored in a search engine’s cache? Impossible to answer. And useless since it will change in the very next instant. Another difficult question to answer is how much information in the cache is valid; that is, does the referenced page exist? But even then, what’s true at one moment may not be true at another.

How complete, accurate, and up-to-date is the information in a search engine’s cache? The best we can hope for is pretty good. More than that is unrealistic. Perfect, it can’t be.

Obviously, such an enormous undertaking is monstrously expensive. Next week, why it’s free. Or is it?

Advertisements

Single Post Navigation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: