What is the Deep Web? A first trip into the abyss
The Deep Web (or Invisible web) is the set of information resources on the World Wide Web not reported by normal search engines.
According several researches the principal search engines index only a small portion of the overall web content, the remaining part is unknown to the majority of web users.
What do you think if you were told that under our feet, there is a world larger than ours and much more crowded? We will literally be shocked, and this is the reaction of those individual who can understand the existence of theDeep Web, a network of interconnected systems, are not indexed, having a size hundreds of times higher than the current web, around 500 times.
Very exhaustive is the definition provided by the founder of BrightPlanet, Mike Bergman, that compared searching on the Internet today to dragging a net across the surface of the ocean: a great deal may be caught in the net, but there is a wealth of information that is deep and therefore missed.
Ordinary search engines to find content on the web using software called "crawlers". This Deep Web technique is ineffective for finding the hidden resources of the Web that could be classified into the following categories:
- Dynamic content: dynamic pages which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge.
- Unlinked content: pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks).
- Private Web: sites that require registration and login (password-protected resources).
- Contextual Web: pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence).
- Limited access content: sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard, CAPTCHAs, or no-cache Pragma HTTP headers which prohibit search engines from browsing them and creating cached copies).
- Non-HTML/text content: textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.
- Text content using the Gopher protocol and files hosted on FTP that are not indexed by most search engines. Engines such as Google do not index pages outside of HTTP or HTTPS.