TCS Daily


Untangling the Web: Man Plus Machine

By Ken Yarmosh - May 22, 2006 12:00 AM

The Web is expanding at an incredible rate. According to Technorati, a leading blog search engine, the size of the "blogosphere" continues to double every six months. Technorati now claims to track over 37 million blogs, while Google says it has indexed over 8 billion web pages. Both numbers represent an stunning amount of data. How best to organize all these data is still being debated.

From the outset, people recognized that organizing information was necessary to make the Web useful. Even with hundreds instead of millions or billions of sources, users could not easily find what they wanted. There really was no central repository where information about information (often referred to as "meta-data") was stored.

There have generally been two methods to help solve the problem of information overload: 1) Use human expertise. 2) Use computer-based algorithms. Two of today's Internet giants - Yahoo! and Google - each began by using one of these approaches.

Yahoo! started out as "Jerry and David's Guide to the World Wide Web". It was basically a categorized hierarchical listing of sites Jerry Yang and David Filo - co-founders of Yahoo! - thought were interesting (Yahoo! stands for Yet Another Hierarchical Officious Oracle). Yahoo! continued to grow with other people adding interesting sites to the Yahoo! directory. Thus, Yahoo! was initially people-driven -- Yahoo! Search and other Yahoo! technologies came later.

Then along came Google. Its founders Sergey Brin and Larry Page had a very different method to solving the same issue. They were convinced that they had a unique approach to "retrieving relevant information from a massive set of data." Their search algorithm revolutionized the way people found information on the Internet. To "Google" something is now part of everyday language.

But even with Yahoo!, Google, and other search engine companies, indexing and retrieving information on the Web efficiently is far from being solved. Besides not being perfect in their own right, blogs have added another level of complexity to this issue. Bloggers typically write about current events related to politics, technology, or other topics of interest. But traditional search engines were not really built to track real-time content, especially not stuff written by average Joes and Janes.

Just as before, the expertise of humans and the automation of technology are being used to try to sort through the noise created by blogs and the 24-7 news cycle. The difference is that, this time, the two approaches are not entirely mutually exclusive. In fact, in some cases, the two are beginning to work hand-in-hand.

More and more traditional print publications are getting their writers to blog. Essentially, these writer/bloggers act as filters, directing readers to interesting links or sharing quick opinions on current news items. BusinessWeek's Blogspotting, written by Stephen Baker and Heather Green, is a good example of real-time news analysis by a business magazine.

On the other side of the coin, there are sites like TailRank. Its focus is to use proprietary real-time ranking algorithms to discover the most linked to mainstream and blogosphere news items. After its analysis, TailRank subsequently lists and clusters related sources in order of popularity on its homepage. A site called memeorandum does something similar for politics and technology (with sister sites tracking gossip and baseball).

While BusinessWeek and TailRank represent the traditional man versus machine approach, other relative newcomers have decided to meld the two ideas. Digg.com, for example, is a community driven site that allows users to "digg" stories. Those stories that get dug the most eventually propagate through the system to the Digg homepage.

The "man plus machine" methodology of sites like Digg are still in their early stages of existence. But the response to them has been overwhelmingly positive. According to stats from Alexa Traffic Rankings, in just about a year and a half of existence, Digg has surpassed the traffic of Slashdot -- previously the de facto source of "news for nerds".

The age-old battle of man versus machine may come to an end -- at least on the web. Yahoo!, Google, and other non-tech companies such as washingtonpost.com have noticed this new man-plus-machine paradigm and have begun to leverage it.

To this point, mostly early adopters are taking advantage of services like Yahoo!'s recently acquired social bookmarking site Delicious. But the acceptance and use of these ideas by the Internet goliaths means they are now accessible to everyday, less technically savvy, Web users. The technology is not quite mature enough for all users to embrace, but great progress is being made. A more useful, more organized Web is just around the corner.

The author is Editor of Corante Web Hub.

Categories:

1 Comment

Data Mining of Web Servers
Nuff said.

TCS Daily Archives