Bundaberg Regional Council

Internet / Reference Article

Invisible Web

There is a common perception that you are able to search or find every document, file and page on the World Wide Web. However, this is somewhat of an impossibility due to many factors. It is even more difficult to imagine when you consider that there are billions of Web pages in existence and a great many more added each day.

One of the factors in making it impossible to search every single Web page that exists is how search engines work. Search engines have special programs called 'spiders' that are designed to crawl the Web and look for new Web pages. When the spider finds a new page it is indexed and added to that search engine's database of Web pages. This is a time-consuming, labour-intensive and on-going process. Therefore, when you search the Web using a search engine, you are technically only searching the database of pages held (indexed) by that search engine at that particular time, and not the whole of the World Wide Web.

Another reason why you are unable to search every Web page using traditional search tools is because of something called the Invisible Web. What is the invisible web? Basically, it is the information or web pages that you cannot retrieve ('see') by using conventional search tools such as search engines and subject directories.

There are two reasons why some content on the Web is part of the invisible web and why search tools cannot index it. One is due to technical reasons and the other to policy decisions. Technical reasons include that the page's code contains instructions for the page not to be indexed by search tools, that the content is only accessible with the use of a password, and that the page requires the inputting of data by the computer user (eg. a form of some sort). Databases are a prime example of material 'held' in the invisible web.

Indexing web pages is such a big task it can be quite costly in terms of time, money and space in the database. Therefore, the company owning/operating the search engine follows a policy detailing what they can exclude from their database. The format of the page is one criterion that determines exclusion. Spiders are designed to read HTML (Hyper Text Markup Language). Pages in Adobe PDF, Microsoft Word, Excel and PowerPoint are examples of special non-HTML formats that are excluded from most search engine databases. However, one exception is the Google search engine, which now provides the ability to search for these special documents.

Next time you are searching the Web, be aware that you may not necessarily be 'seeing' everything there is on the Web about that topic. There are sites dedicated to helping you search the Invisible Web. You can find some of these sites by typing in the words 'invisible web' into a search engine.

For more detail on this topic, see the following sources used for this article:
· Sherman, C. & Price, G. 2001. The Invisible Web: uncovering information sources search engines can't see. Information Today. Medford, NJ. (http://www.invisble-web.net). In the Library at: ANF 025.04 she
· How Search Engines Work
· Invisible Web Information
· Google

Top of Page



Bundaberg Regional Library Service 2002-2009
Bundaberg, Queensland, Australia
Internet Librarian: email here