2009/02/25

Deep Web

Then one day last summer, to search engines Google trundled quietly passed a milestone. Add a trillionth address to the list of web pages it knows about. But as big as possible to this number may seem, it represents only a fraction of the entire Web. Beyond this lies a trillion web pages even vaster hid data: financial information, shopping catalogs, flight schedules, medical research and all sorts of other material stored in databases that remain largely invisible to search engines . Challenges that the major search engines before entering into this so-called Deep Web goes a long way toward explaining why they still can not provide answers to questions which, like “What is the best price from New York to London next Thursday? ” or “When will the Yankees play the Red Sox this year?” The answers are easily available - if only the search engines know how to find them. Now a new breed of technologies is taking shape that will extend the reach of search engine Web’s hidden corners. If that happens, it will do more than improve the quality of search results - could ultimately reshape the way many companies do business online.

Search engines are based on software known as crawlers (or spiders), which gathers information from the following hyperlink paths that tie together Web. While this approach works well for pages that make up the surface Web, they have a hard time penetrating databases that are created to answer the question typed.

“The Web is crawlable tip of the iceberg,” said Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web Search start-up whose investors include Jeffrey P. Bezos, Executive Director of Amazon.com. Kosmix has developed a software that matches the search database, most likely to yield relevant information, then returns an overview of the topic chosen from several sources.

“Most search engines try to help you find a needle in a haystack,” Mr Rajaraman said, “but what we are trying to do is to help you explore the haystack.”

That crowd is infinitely large. With millions of databases connected to the Internet, and endless possible permutations of search terms, simply no way any search engine - no matter how powerful - to go through every possible combination of data on flight.

To extract significant data from the Deep Web, search engines should consider the users search terms and figure out how to broker these questions particular database. For example, if a user types in “Rembrandt”, the search engine needs to know which databases are most likely to contain information about art (like, say, museum catalogs and auction houses), and what kind of question databases will accept.

This approach may sound simple in theory but in practice the vast variety of structures, databases and any search terms Thorny is a computational challenge.

“This is the most interesting database integration problem imaginable,” said Alon Halevy, a former professor of computer science at the University of Washington, who is currently leading a team from Google that attempts to solve mystery Deep Web.

Google Web Search Deep strategy involves sending a program to analyze the contents of each database is experiencing. For example, if the search engine finds a page with a form of art, you start guessing probably search terms - “Rembrandt, Picasso, Vermeer and so on - until one of these Conditions have one match. The search engine then analyzes the results and develop a predictive model of what the database contains.

In the same spirit, Prof. Juliana Freire, University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org), which ultimately aims to crawl and index every database on the Web public. Extracting the contents of so many far-flung data sets requires an equally sophisticated computational guessing game.

“The same would be naive to all the query words in the dictionary,” said Ms. Freire. Instead, DeepPeep start with a small number of sample questions, “so we can use that to build our understanding of choice and databases that the search words.

Based on this analysis, the program then automatically fires off the search terms in an effort to dislodge as much data as possible. Ms. Freire argues that an approach over more than 90 percent of the content stored in any database. Ms. Freire work drew on recent advances of one of the largest companies search engine.

As the major search engines begin to experiment with incorporating Deep Web content in their search results, you must figure out how to present different types of data without overcomplicating their pages. This is a particular dilemma for Google, which has long resisted the temptation to make significant changes to the tried-and-true format of search results.

“Google is facing a real challenge,” said Chris Sherman, executive editor of the Web site Search Engine Land. “They want to make a better experience, but they must be supercautious of making changes for fear of alienating their users.”

Beyond the consumer search, Deep Web Technologies, possibly to use enterprise data in new ways. For example, a health site could cross-reference data from the pharmaceutical companies with the latest discoveries of medical researchers, or a local news site sale could extend coverage by letting users tap into government records stored in the base data.

This level of integration of data could possibly point the way to something like the Semantic Web, to promote long - but so far unrealized - vision of interconnected Web data. Deep Web Technologies hold promise to achieve similar benefits at a much lower cost by automating the analysis of structures of databases and cross-correlation results.

“The huge thing is the ability to connect different data sources,” said Mike Bergman, a scientist and consultant who is credited with coining the term Deep Web. Mr. Bergman said the long-term impact of Deep Web search had more to do with transforming the business than satisfying whims of Web surfers.

ARTICLES MODERN

0 Comments:

Post a Comment